R interfaces to Weka tokenizers.
AlphabeticTokenizer(x, control = NULL)
NGramTokenizer(x, control = NULL)
WordTokenizer(x, control = NULL)
A character vector with the tokenized strings.
a character vector with strings to be tokenized.
an object of class Weka_control
, or a
character vector of control options, or NULL
(default).
Available options can be obtained on-line using the Weka Option
Wizard WOW
, or the Weka documentation.
AlphabeticTokenizer
is an alphabetic string tokenizer, where
tokens are to be formed only from contiguous alphabetic sequences.
NGramTokenizer
splits strings into \(n\)-grams with given
minimal and maximal numbers of grams.
WordTokenizer
is a simple word tokenizer.