Learn R Programming

quanteda (version 1.5.1)

tokens_replace: Replace tokens in a tokens object

Description

Substitute token types based on vectorized one-to-one matching. Since this function is created for lemmatization or user-defined stemming. It support substitution of multi-word features by multi-word features, but substitution is fastest when pattern and replacement are character vectors and valuetype = "fixed" as the function only substitute types of tokens. Please use tokens_lookup with exclusive = FALSE to replace dictionary values.

Usage

tokens_replace(x, pattern, replacement, valuetype = "glob",
  case_insensitive = TRUE, verbose = quanteda_options("verbose"))

Arguments

x

tokens object whose token elements will be replaced

pattern

a character vector or list of character vectors. See pattern for more details.

replacement

a character vector or (if pattern is a list) list of character vectors of the same length as pattern

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore case when matching, if TRUE

verbose

print status messages if TRUE

See Also

tokens_lookup

Examples

Run this code
# NOT RUN {
toks1 <- tokens(data_corpus_irishbudget2010, remove_punct = TRUE)

# lemmatization
infle <- c("foci", "focus", "focused", "focuses", "focusing", "focussed", "focusses")
lemma <- rep("focus", length(infle))
toks2 <- tokens_replace(toks1, infle, lemma, valuetype = "fixed")
kwic(toks2, "focus*")

# stemming
type <- types(toks1)
stem <- char_wordstem(type, "porter")
toks3 <- tokens_replace(toks1, type, stem, valuetype = "fixed", case_insensitive = FALSE)
identical(toks3, tokens_wordstem(toks1, "porter"))

# multi-multi substitution
toks4 <- tokens_replace(toks1, phrase(c("Minister Deputy Lenihan")), 
                              phrase(c("Minister Deputy Conor Lenihan")))
kwic(toks4, phrase(c("Minister Deputy Conor Lenihan")))
# }

Run the code above in your browser using DataLab