collocations2: detect collocations from text

Description

Detects collocations from texts or a corpus, returning a data.frame of collocations and their scores, sorted in descending order of the association measure. Words separated by punctuation delimiters are not counted by default (spanPunct = FALSE) as adjacent and hence are not eligible to be collocations.

Usage

collocations2(x, method = c("lr", "chi2", "pmi", "dice"), features = "*",
  valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
  min_count = 1, size = 2, ...)

Arguments

a character, corpus, tokens object

method

association measure for detecting collocations. Let $i$ index documents, and $j$ index features, $n_{ij}$ refers to observed counts, and $m_{ij}$ the expected counts in a collocations frequency table of dimensions $(J - size + 1)^2$. Available measures are computed as:

"lr": The likelihood ratio statistic $G^2$, computed as: $$2 * \sum_i \sum_j ( n_{ij} * log \frac{n_{ij}}{m_{ij}} )$$
"chi2": Pearson's $\chi^2$ statistic, computed as: $$\sum_i \sum_j \frac{(n_{ij} - m_{ij})^2}{m_{ij}}$$
"pmi": point-wise mutual information score, computed as log $n_{11}/m_{11}$
"dice": the Dice coefficient, computed as $n_{11}/n_{1.} + n_{.1}$
"all": returns all of the above

features

features to be selected for collocations

valuetype

how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore the case when matching features if TRUE

min_count

exclude collocations below this count

size

length of the collocation. Only bigram (n=2) and trigram (n=3) collocations are currently implemented. Can be c(2,3) (or 2:3) to return both bi- and tri-gram collocations.

...

additional parameters passed to tokens

Value

a collocations class object: a specially classed data.table consisting of collocations, their frequencies, and the computed association measure(s).

References

McInnes, B T. 2004. "Extending the Log Likelihood Measure to Improve Collocation Identification." M.Sc. Thesis, University of Minnesota.

Description

Usage

Arguments

Value

References

See Also