Learn R Programming

quanteda (version 0.9.9-50)

tokens_select: select or remove tokens from a tokens object


This function selects or discards tokens from a tokens objects, with the shortcut tokens_remove(x, features) defined as a shortcut for tokens_select(x, features, selection = "remove"). The most common usage for tokens_remove will be to eliminate stop words from a text or text-based object, while the most common use of tokens_select will be to select only positive features from a list of regular expressions, including a dictionary.


tokens_select(x, features, selection = c("keep", "remove"),
  valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
  padding = FALSE, verbose = quanteda_options("verbose"))

tokens_remove(x, features, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, padding = FALSE, verbose = quanteda_options("verbose"))


tokens object whose token elements will be selected
one of: a character vector of features to be selected, a dictionary class object whose values (not keys) will provide the features to be selected.
whether to "keep" or "remove" the features
how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.
ignore case when matching, if TRUE
(only for tokenizedTexts objects) if TRUE, leave an empty string where the removed tokens previously existed. This is useful if a positional match is needed between the pre- and post-selected features, for instance if a window of adjacency needs to be computed.
if TRUE print messages about how many features were removed


a tokens object with features removed


Run this code
## for tokenized texts 
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to 
                   execute the functions of its Chief Magistrate.",
         wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express
                   the high sense I entertain of this distinguished honor.")
tokens_remove(tokens(txt, remove_punct = TRUE), stopwords("english"))

Run the code above in your browser using DataLab