termFrequencies: Frequency of chosen terms in the corpus

Description

List terms with the highest number of occurrences in the document-term matrix of a corpus, possibly grouped by the levels of a variable.

Usage

termFrequencies(dtm, terms, variable = NULL, n = 25, by.term = FALSE)

Value

If variable = NA, one matrix with columns “Global” and Global % (see below). Else, an array with seven columns:

“% Term/Level”: the percent of the term's occurrences in all terms occurrences in the level.
“% Level/Term”: the percent of the term's occurrences that appear in the level (rather than in other levels).
“Global %”: the percent of the term's occurrences in all terms occurrences in the corpus.
“Global”: the number of occurrences of the term in the corpus.
“Level”: the number of occurrences of the term (“internal”).
“t value”: the quantile of a normal distribution corresponding the probability “Prob.”.
“Prob.”: the probability of observing such an extreme (high or low) number of occurrences of the term in the level, under an hypergeometric distribution.

Arguments

dtm: a document-term matrix.
terms: one or more terms, i.e. column names of dtm.
variable: a vector whose length is the number of rows of dtm, or NULL to report most frequent terms by document; use NA to report most frequent terms in the whole corpus.
n: the number of terms to report for each level.
by.term: whether the third dimension of the array should be terms instead of levels.

Author

Milan Bouchet-Valat

Details