collocations: detect collocations from text

Description

Detects collocations from texts or a corpus, returning a data.frame of collocations and their scores, sorted in descending order of the association measure. Words separated by punctuation delimiters are not counted by default (spanPunct = FALSE) as adjacent and hence are not eligible to be collocations.

Usage

collocations(x, method = c("lr", "chi2", "pmi", "dice", "all"), size = 2,
  n = NULL, tolower = TRUE, punctuation = c("dontspan", "ignore",
  "include"), ...)

Arguments

a character, corpus, tokens object

method

association measure for detecting collocations. Let $i$ index documents, and $j$ index features, $n_{ij}$ refers to observed counts, and $m_{ij}$ the expected counts in a collocations frequency table of dimensions $(J - size + 1)^2$. Available measures are computed as:

"lr": The likelihood ratio statistic $G^2$, computed as: $$2 * \sum_i \sum_j ( n_{ij} * log \frac{n_{ij}}{m_{ij}} )$$
"chi2": Pearson's $\chi^2$ statistic, computed as: $$\sum_i \sum_j \frac{(n_{ij} - m_{ij})^2}{m_{ij}}$$
"pmi": point-wise mutual information score, computed as log $n_{11}/m_{11}$
"dice": the Dice coefficient, computed as $n_{11}/n_{1.} + n_{.1}$
"all": returns all of the above

size

length of the collocation. Only bigram (n=2) and trigram (n=3) collocations are currently implemented. Can be c(2,3) (or 2:3) to return both bi- and tri-gram collocations.

the number of collocations to return, sorted in descending order of the requested statistic, or $G^2$ if none is specified.

tolower

convert collocations to lower case if TRUE (default)

punctuation

how to handle tokens separated by punctuation characters. Options are:

dontspan: do not form collocations from tokens separated by punctuation characters (default)
ignore: ignore punctuation characters when forming collocations, meaning that collocations will include those separated by punctuation characters in the text. The punctuation characters themselves are not returned.
include: do not treat punctuation characters specially, meaning that collocations will include punctuation characters as tokens

...

additional parameters passed to tokens

Value

a collocations class object: a specially classed data.table consisting of collocations, their frequencies, and the computed association measure(s).

References

McInnes, B T. 2004. "Extending the Log Likelihood Measure to Improve Collocation Identification." M.Sc. Thesis, University of Minnesota.

Examples

Run this code

# NOT RUN {
txt <- c("This is software testing: looking for (word) pairs!  
         This [is] a software testing again. For.",
         "Here: this is more Software Testing, looking again for word pairs.")
collocations(txt, punctuation = "dontspan") # default
collocations(txt, punctuation = "dontspan", remove_punct = TRUE)  # includes "testing looking"
collocations(txt, punctuation = "ignore", remove_punct = TRUE)    # same as previous 
collocations(txt, punctuation = "include", remove_punct = FALSE)  # keep punctuation as tokens

collocations(txt, size = 2:3)
removeFeatures(collocations(txt, size = 2:3), stopwords("english"))

collocations("@textasdata We really, really love the #quanteda package - thanks!!")
collocations("@textasdata We really, really love the #quanteda package - thanks!!",
              remove_twitter = TRUE)

collocations(data_corpus_inaugural[49:57], n = 10)
collocations(data_corpus_inaugural[49:57], method = "all", n = 10)
collocations(data_corpus_inaugural[49:57], method = "chi2", size = 3, n = 10)
collocations(corpus_subset(data_corpus_inaugural, Year>1980), method = "pmi", size = 3, n = 10)
# }