tokens_segment: Segment tokens object by patterns

Description

Segment tokens by splitting on a pattern match. This is useful for breaking the tokenized texts into smaller document units, based on a regular pattern or a user-supplied annotation. While it normally makes more sense to do this at the corpus level (see corpus_segment), tokens_segment provides the option to perform this operation on tokens.

Usage

tokens_segment(x, pattern, valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE, extract_pattern = FALSE,
  pattern_position = c("before", "after"), use_docvars = TRUE)

Arguments

tokens object whose token elements will be segmented

pattern

a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details.

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore case when matching, if TRUE

extract_pattern

remove matched patterns from the texts and save in docvars, if TRUE

pattern_position

either "before" or "after", depending on whether the pattern precedes the text (as with a tag) or follows the text (as with punctuation delimiters)

use_docvars

if TRUE, repeat the docvar values for each segmented text; if FALSE, drop the docvars in the segmented corpus. Dropping the docvars might be useful in order to conserve space or if these are not desired for the segmented corpus.

Value

tokens_segment returns a tokens object whose documents have been split by patterns

Examples

Run this code

# NOT RUN {
txts <- "Fellow citizens, I am again called upon by the voice of my country to
execute the functions of its Chief Magistrate. When the occasion proper for
it shall arrive, I shall endeavor to express the high sense I entertain of
this distinguished honor."
toks <- tokens(txts)

# split by any punctuation
toks_punc <- tokens_segment(toks, c(".", "?", "!"), valuetype = "fixed", 
                            pattern_position = "after")
toks_punc <- tokens_segment(toks, "^\\p{Sterm}$", valuetype = "regex", 
                            extract_pattern = FALSE, 
                            pattern_position = "after")
# }

Run the code above in your browser using DataLab