
This function extracts the stems of each of the given words in the vector.
wordStem(words, language = "porter")
a character vector of words whose stems are to be extracted.
the name of a recognized language, as returned by
getStemLanguages
, or a two- or three-letter ISO-639
code corresponding to one of these languages (see references for
the list of codes).
A character vector with as many elements as there are in the input vector with the corresponding elements being the stem of the word. Elements of the vector are converted to UTF-8 encoding before the stemming is performed, and the returned elements are marked as such when they contain non-ASCII characters.
This uses Dr. Martin Porter's stemming algorithm and the C libstemmer library generated by Snowball.
http://www.loc.gov/standards/iso639-2/php/code_list.php for a list of ISO-639 language codes.
# NOT RUN {
# Simple example
wordStem(c("win", "winning", "winner"))
# Test the supplied vocabulary
for(lang in getStemLanguages()) {
load(system.file("words", paste0(lang, ".RData"), package="SnowballC"))
stopifnot(all(wordStem(voc[[1]], lang) == voc[[2]]))
}
stopifnot(is.na(wordStem(NA)))
# }
Run the code above in your browser using DataLab