tokens_trim: Trim tokens using frequency threshold-based feature selection

Description

Returns a tokens object reduced in size based on document and term frequency, usually in terms of a minimum frequency, but may also be in terms of maximum frequencies. Setting a combination of minimum and maximum frequencies will select features based on a range.

Usage

tokens_trim(
  x,
  min_termfreq = NULL,
  max_termfreq = NULL,
  termfreq_type = c("count", "prop", "rank", "quantile"),
  min_docfreq = NULL,
  max_docfreq = NULL,
  docfreq_type = c("count", "prop", "rank", "quantile"),
  padding = FALSE,
  verbose = quanteda_options("verbose")
)

Value

A tokens object with reduced size.

Arguments

x: a dfm object
min_termfreq, max_termfreq: minimum/maximum values of feature frequencies across all documents, below/above which features will be removed
termfreq_type: how min_termfreq and max_termfreq are interpreted. "count" sums the frequencies; "prop" divides the term frequencies by the total sum; "rank" is matched against the inverted ranking of features in terms of overall frequency, so that 1, 2, ... are the highest and second highest frequency features, and so on; "quantile" sets the cutoffs according to the quantiles (see quantile()) of term frequencies.
min_docfreq, max_docfreq: minimum/maximum values of a feature's document frequency, below/above which features will be removed
docfreq_type: specify how min_docfreq and max_docfreq are interpreted. "count" is the same as [docfreq](x, scheme = "count"); "prop" divides the document frequencies by the total sum; "rank" is matched against the inverted ranking of document frequency, so that 1, 2, ... are the features with the highest and second highest document frequencies, and so on; "quantile" sets the cutoffs according to the quantiles (see quantile()) of document frequencies.
padding: if TRUE, leave an empty string where the removed tokens previously existed.
verbose: print messages

Examples

Run this code

toks <- tokens(data_corpus_inaugural)

# keep only words occurring >= 10 times and in >= 2 documents
tokens_trim(toks, min_termfreq = 10, min_docfreq = 2, padding = TRUE)

# keep only words occurring >= 10 times and no more than 90% of the documents
tokens_trim(toks, min_termfreq = 10, max_docfreq = 0.9, docfreq_type = "prop",
            padding = TRUE)

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples