Learn R Programming

quanteda (version 0.9.9-65)

collocations: detect collocations from text

Description

Detects collocations from texts or a corpus, returning a data.frame of collocations and their scores, sorted in descending order of the association measure. Words separated by punctuation delimiters are not counted by default (spanPunct = FALSE) as adjacent and hence are not eligible to be collocations.

Usage

collocations(x, method = c("lr", "chi2", "pmi", "dice", "all"), size = 2,
  n = NULL, tolower = TRUE, punctuation = c("dontspan", "ignore",
  "include"), ...)

Arguments

x

a character, corpus, tokens object

method

association measure for detecting collocations. Let \(i\) index documents, and \(j\) index features, \(n_{ij}\) refers to observed counts, and \(m_{ij}\) the expected counts in a collocations frequency table of dimensions \((J - size + 1)^2\). Available measures are computed as:

"lr"

The likelihood ratio statistic \(G^2\), computed as: $$2 * \sum_i \sum_j ( n_{ij} * log \frac{n_{ij}}{m_{ij}} )$$

"chi2"

Pearson's \(\chi^2\) statistic, computed as: $$\sum_i \sum_j \frac{(n_{ij} - m_{ij})^2}{m_{ij}}$$

"pmi"

point-wise mutual information score, computed as log \(n_{11}/m_{11}\)

"dice"

the Dice coefficient, computed as \(n_{11}/n_{1.} + n_{.1}\)

"all"

returns all of the above

size

length of the collocation. Only bigram (n=2) and trigram (n=3) collocations are currently implemented. Can be c(2,3) (or 2:3) to return both bi- and tri-gram collocations.

n

the number of collocations to return, sorted in descending order of the requested statistic, or \(G^2\) if none is specified.

tolower

convert collocations to lower case if TRUE (default)

punctuation

how to handle tokens separated by punctuation characters. Options are:

dontspan

do not form collocations from tokens separated by punctuation characters (default)

ignore

ignore punctuation characters when forming collocations, meaning that collocations will include those separated by punctuation characters in the text. The punctuation characters themselves are not returned.

include

do not treat punctuation characters specially, meaning that collocations will include punctuation characters as tokens

...

additional parameters passed to tokens

Value

a collocations class object: a specially classed data.table consisting of collocations, their frequencies, and the computed association measure(s).

References

McInnes, B T. 2004. "Extending the Log Likelihood Measure to Improve Collocation Identification." M.Sc. Thesis, University of Minnesota.

See Also

tokens_ngrams

Examples

Run this code
# NOT RUN {
txt <- c("This is software testing: looking for (word) pairs!  
         This [is] a software testing again. For.",
         "Here: this is more Software Testing, looking again for word pairs.")
collocations(txt, punctuation = "dontspan") # default
collocations(txt, punctuation = "dontspan", remove_punct = TRUE)  # includes "testing looking"
collocations(txt, punctuation = "ignore", remove_punct = TRUE)    # same as previous 
collocations(txt, punctuation = "include", remove_punct = FALSE)  # keep punctuation as tokens

collocations(txt, size = 2:3)
removeFeatures(collocations(txt, size = 2:3), stopwords("english"))

collocations("@textasdata We really, really love the #quanteda package - thanks!!")
collocations("@textasdata We really, really love the #quanteda package - thanks!!",
              remove_twitter = TRUE)

collocations(data_corpus_inaugural[49:57], n = 10)
collocations(data_corpus_inaugural[49:57], method = "all", n = 10)
collocations(data_corpus_inaugural[49:57], method = "chi2", size = 3, n = 10)
collocations(corpus_subset(data_corpus_inaugural, Year>1980), method = "pmi", size = 3, n = 10)
# }

Run the code above in your browser using DataLab