An n-gram tokenizer with identical output to the NGramTokenizer
function from the RWeka package.
Usage
ngram_asweka(str, min = 2, max = 2, sep = " ")
Value
A vector of n-grams listed in decreasing blocks of n, in order within a
block. The output matches that of RWeka's n-gram tokenizer.
Arguments
str
The input text.
min, max
The minimum and maximum 'n' as in 'n-gram'.
sep
A set of separator characters for the "words". See details for
information about how this works; it works a little differently
from sep arguments in R functions.
Details
This n-gram tokenizer behaves similarly in both input and return to
the tokenizer in RWeka. Unlike the tokenizer ngram(), the
return is not a special class of external pointers; it is a vector,
and therefore can be serialized via save() or saveRDS().