ngramTokens: Ngram Tokenizer

Description

Tally bag-of-words ngram features

Usage

ngramTokens(
  texts,
  wstem = "all",
  ngrams = 1,
  language = "english",
  punct = TRUE,
  stop.words = TRUE,
  number.words = TRUE,
  per.100 = FALSE,
  overlap = 1,
  sparse = 0.995,
  verbose = FALSE,
  vocabmatch = NULL,
  num.mc.cores = 1
)

Value

a matrix of feature counts

Arguments

texts: character vector of texts.
wstem: character Which words should be stemmed? Defaults to "all".
ngrams: numeric Vector of ngram lengths to be included. Default is 1 (i.e. unigrams only).
language: Language for stemming. Default is "english"
punct: logical Should punctuation be kept as tokens? Default is TRUE
stop.words: logical Should stop words be kept? Default is TRUE
number.words: logical Should numbers be kept as words? Default is TRUE
per.100: logical Should counts be expressed as frequency per 100 words? Default is FALSE
overlap: numeric Threshold (as cosine distance) for including ngrams that constitute other included phrases. Default is 1 (i.e. all ngrams included).
sparse: maximum feature sparsity for inclusion (1 = include all features)
verbose: logical Should the package report token counts after each ngram level? Useful for long-running code. Default is FALSE.
vocabmatch: matrix Should the new token count matrix will be coerced to include the same tokens as a previous count matrix? Default is NULL (i.e. no token match).
num.mc.cores: numeric number of cores for parallel processing - see parallel::detectCores(). Default is 1.

Details

This function produces ngram featurizations of text based on the quanteda package. This provides a complement to the doc2concrete function by demonstrating How to build a feature set for training a new detection algorithm in other contexts.

Examples

Run this code


dim(ngramTokens(feedback_dat$feedback, ngrams=1))
dim(ngramTokens(feedback_dat$feedback, ngrams=1:3))

Run the code above in your browser using DataLab