corpus_segment: segment texts into component elements

Description

Segment corpus text(s) or a character vector into tokens, sentences, paragraphs, or other sections. segment works on a character vector or corpus object, and allows the delimiters to be user-defined. This is useful for breaking the texts of a corpus into smaller documents based on sentences, or based on a user defined "tag" pattern. See details.

Usage

corpus_segment(x, what = c("sentences", "paragraphs", "tokens", "tags", "other"), delimiter = switch(what, paragraphs = "\\n{2}", tags = "##\\w+\\b", NULL), valuetype = c("regex", "fixed", "glob"), keepdocvars = TRUE, ...)
char_segment(x, what = c("sentences", "paragraphs", "tokens", "tags", "other"), delimiter = switch(what, paragraphs = "\\n{2}", tags = "##\\w+\\b", NULL), valuetype = c("regex", "fixed", "glob"), keepdocvars = TRUE, ...)

Arguments

corpus object whose texts will be segmented

what

unit of segmentation. Current options are "tokens" (default), "sentences", "paragraphs", "tags", and "other". Segmenting on other allows segmentation of a text on any user-defined value, and must be accompanied by the delimiter argument. Segmenting on tags performs the same function but preserves the tags as a document variable in the segmented corpus.

delimiter

delimiter defined as a regex for segmentation; only relevant for what = "paragraphs" (where the default is two newlines) and what = "tags" (where the default is a tag preceded by two pound or "hash" signs ##). Delimiter has no effect for segmentation into tokens or sentences

valuetype

how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

keepdocvars

(for corpus objects) if TRUE, repeat the docvar values for each segmented text; if FALSE, drop the docvars in the segmented corpus. Dropping the docvars might be useful in order to conserve space or if these are not desired for the segmented corpus.

...

provides additional arguments passed to tokens, if what = "tokens" is used

Value

A corpus of segmented texts.

Details

Tokens are delimited by separators. For tokens and sentences, these are determined by the tokenizer behaviour in tokens. For paragraphs, the default is two carriage returns, although this could be changed to a single carriage return by changing the value of delimiter to "\\n{1}" which is the R version of the regex for one newline character. (You might need this if the document was created in a word processor, for instance, and the lines were wrapped in the window rather than being hard-wrapped with a newline character.)

Examples

Run this code

testCorpus <- corpus(c("##INTRO This is the introduction.
                        ##DOC1 This is the first document.  Second sentence in Doc 1.
                        ##DOC3 Third document starts here.  End of third document.",
                       "##INTRO Document ##NUMBER Two starts before ##NUMBER Three."))
# add a docvar
testCorpus[["serialno"]] <- paste0("textSerial", 1:ndoc(testCorpus))
testCorpusSeg <- corpus_segment(testCorpus, "tags")
summary(testCorpusSeg)
texts(testCorpusSeg)
# segment a corpus into sentences
segmentedCorpus <- corpus_segment(corpus(data_char_ukimmig2010), "sentences")
summary(segmentedCorpus)

# same as tokenize()
identical(as.character(tokens(data_char_ukimmig2010)), 
          as.character(char_segment(data_char_ukimmig2010, what = "tokens")))

# segment into paragraphs
char_segment(data_char_ukimmig2010[3:4], "paragraphs")

# segment a text into sentences
segmentedChar <- char_segment(data_char_ukimmig2010, "sentences")
segmentedChar[3]

Run the code above in your browser using DataLab