frequentTerms: List most frequent terms of a corpus

Description

List terms with the highest number of occurrences in the document-term matrix of a corpus, possibly grouped by the levels of a variable.

Usage

frequentTerms(dtm, variable = NULL, n = 25)

Value

If variable = NA, one matrix with columns “Global” and Global % (see below). Else, a list of matrices, one for each level of the variable, with seven columns:

“% Term/Level”: the percent of the term's occurrences in all terms occurrences in the level.
“% Level/Term”: the percent of the term's occurrences that appear in the level (rather than in other levels).
“Global %”: the percent of the term's occurrences in all terms occurrences in the corpus.
“Level”: the number of occurrences of the term in the level (“internal”).
“Global”: the number of occurrences of the term in the corpus.
“t value”: the quantile of a normal distribution corresponding the probability “Prob.”.
“Prob.”: the probability of observing such an extreme (high or low) number of occurrences of the term in the level, under an hypergeometric distribution.

Arguments

dtm: a document-term matrix.
variable: a vector whose length is the number of rows of dtm, or NULL to report most frequent terms by document; use NA to report most frequent terms in the whole corpus.
n: the number of terms to report for each level.

Author

Milan Bouchet-Valat

Details