Learn R Programming

quanteda (version 4.2.0)

tokens_wordstem: Stem the terms in an object

Description

Apply a stemmer to words. This is a wrapper to wordStem designed to allow this function to be called without loading the entire SnowballC package. wordStem uses Martin Porter's stemming algorithm and the C libstemmer library generated by Snowball.

Usage

tokens_wordstem(
  x,
  language = quanteda_options("language_stemmer"),
  verbose = quanteda_options("verbose")
)

char_wordstem( x, language = quanteda_options("language_stemmer"), check_whitespace = TRUE )

dfm_wordstem( x, language = quanteda_options("language_stemmer"), verbose = quanteda_options("verbose") )

Value

tokens_wordstem() returns a tokens object whose word types have been stemmed.

char_wordstem() returns a character object whose word types have been stemmed.

dfm_wordstem() returns a dfm object whose word types (features) have been stemmed, and recombined to consolidate features made equivalent because of stemming.

Arguments

x

a character, tokens, or dfm object whose word stems are to be removed. If tokenized texts, the tokenization must be word-based.

language

the name of a recognized language, as returned by getStemLanguages, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes)

verbose

if TRUE print the number of tokens and documents before and after the function is applied. The number of tokens does not include paddings.

check_whitespace

logical; if TRUE, stop with a warning when trying to stem inputs containing whitespace

References

https://snowballstem.org/

http://www.iso.org/iso/home/standards/language_codes.htm for the ISO-639 language codes

See Also

Examples

Run this code
# example applied to tokens
txt <- c(one = "eating eater eaters eats ate",
         two = "taxing taxes taxed my tax return")
th <- tokens(txt)
tokens_wordstem(th)

# simple example
char_wordstem(c("win", "winning", "wins", "won", "winner"))

# example applied to a dfm
(origdfm <- dfm(tokens(txt)))
dfm_wordstem(origdfm)

Run the code above in your browser using DataLab