Learn R Programming

quanteda (version 0.8.2-1)

tokenize: tokenize a set of texts

Usage

tokenize(x, ...)

## S3 method for class 'character': tokenize(x, what = c("word", "sentence", "character", "fastestword", "fasterword"), removeNumbers = FALSE, removePunct = FALSE, removeSeparators = TRUE, removeTwitter = FALSE, ngrams = 1, concatenator = "_", simplify = FALSE, verbose = FALSE, ...)

## S3 method for class 'corpus': tokenize(x, ...)

Arguments

x
The text(s) or corpus to be tokenized
...
additional arguments not used
what
the unit for splitting the text, defaults to "word". Available alternatives are c("character", "word", "line_break", "sentence"). See stringi-search-boundaries.
removeNumbers
remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day
removePunct
remove all punctuation
removeSeparators
remove Separators and separator characters (spaces and variations of spaces, plus tab, newlines, and anything else in the Unicode "separator" category) when removePunct=FALSE. Only applicable for what = "character" (when you pro
removeTwitter
remove Twitter characters @ and #}; set to FALSE if you wish to eliminate these.

ngrams{integer vector of the n for n-grams, defaulting to 1 (unigrams). For bigrams, for i