corpus_segment: segment texts into component elements

Description

Segment corpus text(s) or a character vector into tokens, sentences, paragraphs, or other sections. segment works on a character vector or corpus object, and allows the delimiters to be user-defined. This is useful for breaking the texts of a corpus into smaller documents based on sentences, or based on a user defined "tag" pattern. See Details.

Usage

corpus_segment(x, what = c("sentences", "paragraphs", "tokens", "tags",
  "other"), delimiter = NULL, valuetype = c("regex", "fixed", "glob"),
  omit_empty = TRUE, use_docvars = TRUE, ...)
char_segment(x, what = c("sentences", "paragraphs", "tokens", "tags",
  "other"), delimiter = NULL, valuetype = c("regex", "fixed", "glob"),
  omit_empty = TRUE, use_docvars = TRUE, ...)

Arguments

character or corpus object whose texts will be segmented

what

unit of segmentation. Current options are "sentences" (default), "paragraphs", "tokens", "tags", and "other".

Segmenting on "other" allows segmentation of a text on any user-defined value, and must be accompanied by the delimiter argument. Segmenting on "tags" performs the same function but preserves the tags as a document variable in the segmented corpus.

delimiter

delimiter defined as a regex for segmentation; only relevant for what = "paragraphs" (where the default is two newlines), "tags" (where the default is a tag preceded by two pound or "hash" signs ##), and "other".

valuetype

how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

omit_empty

if TRUE, empty texts are removed

use_docvars

(for corpus objects only) if TRUE, repeat the docvar values for each segmented text; if FALSE, drop the docvars in the segmented corpus. Dropping the docvars might be useful in order to conserve space or if these are not desired for the segmented corpus.

...

provides additional arguments passed to tokens, if what = "tokens" is used

Value

corpus_segment returns a corpus of segmented texts, with a tag docvar if what = "tags".

char_segment returns a character vector of segmented texts

Using delimiters

One of the most common uses for corpus_segment is to partition a corpus into sub-documents using tags. By default, the tag value is any word that begins with a double "hash" sign and is followed by a whitespace. This can be modified but be careful to use the syntax for the trailing word boundary (\\b)

The default values for delimiter are, according to valuetype:

paragraphs: "\\n{2}", regular expression meaning two newlines. If you wish to define a paragaph as a single newline, change the 2 to a 1.
tags: "##\\w+\\b", a regular expression meaning two "hash" characters followed by any number of word characters followed by a word boundary (a whitespace or the end of the text).
other: No default; user must supply one.
tokens, sentences: Delimiters do not apply to these, and a warning will be issued if you attempt to supply one.

Delimiters may be defined for different valuetypes but these may produce unexpected results, for example the lack of the ability in a "glob" expression to define the word boundaries.

Details

Tokens are delimited by separators. For tokens and sentences, these are determined by the tokenizer behaviour in tokens.

For paragraphs, the default is two carriage returns, although this could be changed to a single carriage return by changing the value of delimiter to "\\n{1}" which is the R version of the regex for one newline character. (You might need this if the document was created in a word processor, for instance, and the lines were wrapped in the window rather than being hard-wrapped with a newline character.)

Examples

Run this code

# NOT RUN {
## segmenting a corpus

testCorpus <- 
corpus(c("##INTRO This is the introduction.
          ##DOC1 This is the first document.  Second sentence in Doc 1.
          ##DOC3 Third document starts here.  End of third document.",
         "##INTRO Document ##NUMBER Two starts before ##NUMBER Three."))
# add a docvar
testCorpus[["serialno"]] <- paste0("textSerial", 1:ndoc(testCorpus))
testCorpusSeg <- corpus_segment(testCorpus, "tags")
summary(testCorpusSeg)
texts(testCorpusSeg)
# segment a corpus into sentences
segmentedCorpus <- corpus_segment(corpus(data_char_ukimmig2010), "sentences")
summary(segmentedCorpus)

## segmenting a character object

# same as tokenize()
identical(as.character(tokens(data_char_ukimmig2010)), 
          as.character(char_segment(data_char_ukimmig2010, what = "tokens")))

# segment into paragraphs
char_segment(data_char_ukimmig2010[3:4], "paragraphs")

# segment a text into sentences
segmentedChar <- char_segment(data_char_ukimmig2010, "sentences")
segmentedChar[3]
# }

Run the code above in your browser using DataLab