Learn R Programming

quanteda (version 0.9.9-50)

collocations: detect collocations from text

Description

Detects collocations from texts or a corpus, returning a data.frame of collocations and their scores, sorted in descending order of the association measure. Words separated by punctuation delimiters are not counted by default (spanPunct = FALSE) as adjacent and hence are not eligible to be collocations.

Usage

collocations(x, method = c("lr", "chi2", "pmi", "dice", "all"), size = 2,
  n = NULL, tolower = TRUE, punctuation = c("dontspan", "ignore",
  "include"), ...)

Arguments

x
a character, corpus, tokens object
method
association measure for detecting collocations. Let \(i\) index documents, and \(j\) index features, \(n_{ij}\) refers to observed counts, and \(m_{ij}\) the expected counts in a collocations frequency table of dimensions \((J - size + 1)^2\). Available measures are computed as:
"lr"
The likelihood ratio statistic \(G^2\), computed as: $$2 * \sum_i \sum_j ( n_{ij} * log \frac{n_{ij}}{m_{ij}} )$$
"chi2"
Pearson's \(\chi^2\) statistic, computed as: $$\sum_i \sum_j \frac{(n_{ij} - m_{ij})^2}{m_{ij}}$$
"pmi"
point-wise mutual information score, computed as log \(n_{11}/m_{11}\)
"dice"
the Dice coefficient, computed as \(n_{11}/n_{1.} + n_{.1}\)
"all"
returns all of the above
size
length of the collocation. Only bigram (n=2) and trigram (n=3) collocations are currently implemented. Can be c(2,3) (or 2:3) to return both bi- and tri-gram collocations.
n
the number of collocations to return, sorted in descending order of the requested statistic, or \(G^2\) if none is specified.
tolower
convert collocations to lower case if TRUE (default)
punctuation
how to handle tokens separated by punctuation characters. Options are:
dontspan
do not form collocations from tokens separated by punctuation characters (default)
ignore
ignore punctuation characters when forming collocations, meaning that collocations will include those separated by punctuation characters in the text. The punctuation characters themselves are not returned.
include
do not treat punctuation characters specially, meaning that collocations will include punctuation characters as tokens
...
additional parameters passed to tokens

Value

a collocations class object: a specially classed data.table consisting of collocations, their frequencies, and the computed association measure(s).

References

McInnes, B T. 2004. "Extending the Log Likelihood Measure to Improve Collocation Identification." M.Sc. Thesis, University of Minnesota.

See Also

tokens_ngrams

Examples

Run this code
## Not run: ------------------------------------
# txt <- c("This is software testing: looking for (word) pairs!  
#          This [is] a software testing again. For.",
#          "Here: this is more Software Testing, looking again for word pairs.")
# collocations(txt, punctuation = "dontspan") # default
# collocations(txt, punctuation = "dontspan", remove_punct = TRUE)  # includes "testing looking"
# collocations(txt, punctuation = "ignore", remove_punct = TRUE)    # same as previous 
# collocations(txt, punctuation = "include", remove_punct = FALSE)  # keep punctuation as tokens
# 
# collocations(txt, size = 2:3)
# removeFeatures(collocations(txt, size = 2:3), stopwords("english"))
# 
# collocations("@textasdata We really, really love the #quanteda package - thanks!!")
# collocations("@textasdata We really, really love the #quanteda package - thanks!!",
#               remove_twitter = TRUE)
# 
# collocations(data_corpus_inaugural[49:57], n = 10)
# collocations(data_corpus_inaugural[49:57], method = "all", n = 10)
# collocations(data_corpus_inaugural[49:57], method = "chi2", size = 3, n = 10)
# collocations(corpus_subset(data_corpus_inaugural, Year>1980), method = "pmi", size = 3, n = 10)
## ---------------------------------------------

Run the code above in your browser using DataLab