Learn R Programming

kgrams (version 0.1.0)

tknz_sent: Sentence tokenizer

Description

Extract sentences from a batch of text lines.

Usage

tknz_sent(input, EOS = "[.?!:;]+", keep_first = FALSE)

Arguments

input

a character vector.

EOS

a regular expression matching an End-Of-Sentence delimiter.

keep_first

TRUE or FALSE? Should the first character of the matches be appended to the returned sentences (with a space)?

Value

a character vector, each entry of which corresponds to a single sentence.

Details

tknz_sent() splits text into sentences using a list of single character delimiters, specified by the parameter EOS. Specifically, when an EOS token is found, the next sentence begins at the first position in the input string not containing any of the EOS tokens or white space (so that entries like "Hi there!!!" or "Hello . . ." are both recognized as a single sentence).

If keep_first is FALSE, the delimiters are stripped off from the returned sequences, which means that all delimiters are treated symmetrically.

In the absence of any EOS delimiter, tknz_sent() returns the input as is, since parts of text corresponding to different entries of the input vector x are understood as parts of separate sentences.

Examples

Run this code
# NOT RUN {
tknz_sent("Hi there! I'm using `sbo`.")
# }

Run the code above in your browser using DataLab