segment: segment texts into component elements

Description

Segment text(s) into tokens, sentences, paragraphs, or other sections. segment works on a character vector or corpus object, and allows the delimiters to be defined. See details.

Usage

segment(x, ...)
"segment"(x, what = c("tokens", "sentences", "paragraphs", "tags", "other"), delimiter = ifelse(what == "tokens", " ", ifelse(what == "sentences", "[.!?:;]", ifelse(what == "paragraphs", "\\n{2}", ifelse(what == "tags", "##\\w+\\b", NULL)))), valuetype = c("regex", "fixed", "glob"), perl = FALSE, ...)
"segment"(x, what = c("tokens", "sentences", "paragraphs", "tags", "other"), delimiter = ifelse(what == "tokens", " ", ifelse(what == "sentences", "[.!?:;]", ifelse(what == "paragraphs", "\\n{2}", ifelse(what == "tags", "##\\w+\\b", NULL)))), valuetype = c("regex", "fixed", "glob"), perl = FALSE, keepdocvars = TRUE, ...)

Arguments

text or corpus object to be segmented

...

provides additional arguments passed to tokenize, if what = "tokens" is used

what

unit of segmentation. Current options are "tokens" (default), "sentences", "paragraphs", "tags", and "other". Segmenting on other allows segmentation of a text on any user-defined value, and must be accompanied by the delimiter argument. Segmenting on tags performs the same function but preserves the tags as a document variable in the segmented corpus.

delimiter

delimiter defined as a regex for segmentation. Each type has its own default, except other, which requires a value to be specified.

valuetype

how to interpret the delimiter: fixed for exact matching; "regex" for regular expressions; or "glob" for "glob"-style wildcard patterns

perl

logical. Should Perl-compatible regular expressions be used?

keepdocvars

if TRUE, repeat the docvar values for each segmented text; if FALSE, drop the docvars in the segmented corpus. Dropping the docvars might be useful in order to conserve space or if these are not desired for the segmented corpus.

Value

A list of segmented texts, with each element of the list correponding to one of the original texts.

Details

Tokens are delimited by Separators. For sentences, the delimiter can be defined by the user. The default for sentences includes ., !, ?, plus ; and :. For paragraphs, the default is two carriage returns, although this could be changed to a single carriage return by changing the value of delimiter to "\\n{1}" which is the R version of the regex for one newline character. (You might need this if the document was created in a word processor, for instance, and the lines were wrapped in the window rather than being hard-wrapped with a newline character.)

Examples

Run this code

# same as tokenize()
identical(tokenize(ukimmigTexts), segment(ukimmigTexts))

# segment into paragraphs
segment(ukimmigTexts[3:4], "paragraphs")

# segment a text into sentences
segmentedChar <- segment(ukimmigTexts, "sentences")
segmentedChar[2]
testCorpus <- corpus(c("##INTRO This is the introduction. 
                       ##DOC1 This is the first document.  
                       Second sentence in Doc 1.  
                       ##DOC3 Third document starts here.  
                       End of third document.",
                      "##INTRO Document ##NUMBER Two starts before ##NUMBER Three."))
# add a docvar
testCorpus[["serialno"]] <- paste0("textSerial", 1:ndoc(testCorpus))
testCorpusSeg <- segment(testCorpus, "tags")
summary(testCorpusSeg)
texts(testCorpusSeg)
# segment a corpus into sentences
segmentedCorpus <- segment(corpus(ukimmigTexts), "sentences")
identical(ndoc(segmentedCorpus), length(unlist(segmentedChar)))

Run the code above in your browser using DataLab