clean

tokenise

tokenize

tokenize.character

tokenize.corpus

the unit for splitting the text, defaults to <code>"word"</code>.
Available alternatives are <code>c("character", "word", "line_break",
"sentence")</code>. See <a href="/link/stringi-search-boundaries?package=quanteda&version=0.8.2-1&to=stringi" rd-options="stringi" data-mini-rdoc="stringi::stringi-search-boundaries">stringi-search-boundaries</a>.

what

remove tokens that consist only of numbers, but not
words that start with digits, e.g. <code>2day</code>

removeNumbers

removePunct

remove Separators and separator characters (spaces
and variations of spaces, plus tab, newlines, and anything else in the
Unicode "separator" category) when <code>removePunct=FALSE</code>. Only applicable
for <code>what = "character"</code> (when you pro

removeSeparators

remove Twitter characters <code>@</code> and <code>#}; set to
<code>FALSE</code> if you wish to eliminate these.</code><item>ngrams</item>{integer vector of the n for n-grams, defaulting
to <code>1</code> (unigrams). For bigrams, for i

removeTwitter

A fast, flexible toolset for for the management, processing, and
    quantitative analysis of textual data in R.

tokenize: tokenize a set of texts

Usage

Arguments