Tally bag-of-words ngram features
ngramTokens(
texts,
wstem = "all",
ngrams = 1,
language = "english",
punct = TRUE,
stop.words = TRUE,
number.words = TRUE,
per.100 = FALSE,
overlap = 1,
sparse = 0.995,
verbose = FALSE,
vocabmatch = NULL,
num.mc.cores = 1
)
a matrix of feature counts
character vector of texts.
character Which words should be stemmed? Defaults to "all".
numeric Vector of ngram lengths to be included. Default is 1 (i.e. unigrams only).
Language for stemming. Default is "english"
logical Should punctuation be kept as tokens? Default is TRUE
logical Should stop words be kept? Default is TRUE
logical Should numbers be kept as words? Default is TRUE
logical Should counts be expressed as frequency per 100 words? Default is FALSE
numeric Threshold (as cosine distance) for including ngrams that constitute other included phrases. Default is 1 (i.e. all ngrams included).
maximum feature sparsity for inclusion (1 = include all features)
logical Should the package report token counts after each ngram level? Useful for long-running code. Default is FALSE.
matrix Should the new token count matrix will be coerced to include the same tokens as a previous count matrix? Default is NULL (i.e. no token match).
numeric number of cores for parallel processing - see parallel::detectCores(). Default is 1.
This function produces ngram featurizations of text based on the quanteda package. This provides a complement to the doc2concrete function by demonstrating How to build a feature set for training a new detection algorithm in other contexts.
dim(ngramTokens(feedback_dat$feedback, ngrams=1))
dim(ngramTokens(feedback_dat$feedback, ngrams=1:3))
Run the code above in your browser using DataLab