tokens_select: select or remove tokens from a tokens object

Description

This function selects or discards tokens from a tokens objects, with the shortcut tokens_remove(x, features) defined as a shortcut for tokens_select(x, features, selection = "remove"). The most common usage for tokens_remove will be to eliminate stop words from a text or text-based object, while the most common use of tokens_select will be to select only positive features from a list of regular expressions, including a dictionary.

Usage

tokens_select(x, features, selection = c("keep", "remove"),
  valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
  padding = FALSE, verbose = quanteda_options("verbose"))
tokens_remove(x, features, valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE, padding = FALSE,
  verbose = quanteda_options("verbose"))

Arguments

tokens object whose token elements will be selected

features

one of: a character vector of features to be selected, a dictionary class object whose values (not keys) will provide the features to be selected.

selection

whether to "keep" or "remove" the features

valuetype

how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore case when matching, if TRUE

padding

(only for tokenizedTexts objects) if TRUE, leave an empty string where the removed tokens previously existed. This is useful if a positional match is needed between the pre- and post-selected features, for instance if a window of adjacency needs to be computed.

verbose

if TRUE print messages about how many features were removed

Value

a tokens object with features removed

Examples

Run this code

## for tokenized texts 
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to 
                   execute the functions of its Chief Magistrate.",
         wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express
                   the high sense I entertain of this distinguished honor.")
tokens_remove(tokens(txt, remove_punct = TRUE), stopwords("english"))

Run the code above in your browser using DataLab