Detects collocations from texts or a corpus, returning a data.frame of
collocations and their scores, sorted in descending order of the association
measure. Words separated by punctuation delimiters are not counted by
default (spanPunct = FALSE
) as adjacent and hence are not eligible to
be collocations.
collocations2(x, method = c("lr", "chi2", "pmi", "dice"), features = "*",
valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
min_count = 1, size = 2, ...)
association measure for detecting collocations. Let \(i\) index documents, and \(j\) index features, \(n_{ij}\) refers to observed counts, and \(m_{ij}\) the expected counts in a collocations frequency table of dimensions \((J - size + 1)^2\). Available measures are computed as:
"lr"
The likelihood ratio statistic \(G^2\), computed as: $$2 * \sum_i \sum_j ( n_{ij} * log \frac{n_{ij}}{m_{ij}} )$$
"chi2"
Pearson's \(\chi^2\) statistic, computed as: $$\sum_i \sum_j \frac{(n_{ij} - m_{ij})^2}{m_{ij}}$$
"pmi"
point-wise mutual information score, computed as log \(n_{11}/m_{11}\)
"dice"
the Dice coefficient, computed as \(n_{11}/n_{1.} + n_{.1}\)
"all"
returns all of the above
features to be selected for collocations
how to interpret keyword expressions: "glob"
for
"glob"-style wildcard expressions; "regex"
for regular expressions;
or "fixed"
for exact matching. See valuetype for details.
ignore the case when matching features if TRUE
exclude collocations below this count
length of the collocation. Only bigram (n=2
) and trigram
(n=3
) collocations are currently implemented. Can be c(2,3)
(or 2:3
) to return both bi- and tri-gram collocations.
additional parameters passed to tokens
a collocations class object: a specially classed data.table consisting of collocations, their frequencies, and the computed association measure(s).
McInnes, B T. 2004. "Extending the Log Likelihood Measure to Improve Collocation Identification." M.Sc. Thesis, University of Minnesota.