Learn R Programming

quanteda (version 0.9.9-50)

textstat_collocations: calculate collocation statistics

Description

Identify and score collocations from a corpus, character, or tokens object, with targeted selection.

#' @rdname textstat_collocations #' @noRd #' @export textstat_collocations.character <- function(x, method = c("lr", "chi2", "pmi", "dice", "bj"), ...) method <- match.arg(method) textstat_collocations(tokens(x), method = method, ...)

Usage

textstat_collocations(x, method = c("lr", "chi2", "pmi", "dice", "bj"),
  max_size = 3, min_count = 2, ...)

is.collocations(x)

Arguments

x
a character, corpus, or tokens object to be mined for collocations
method
association measure for detecting collocations. Let \(i\) index documents, and \(j\) index features, \(n_{ij}\) refers to observed counts, and \(m_{ij}\) the expected counts in a collocations frequency table of dimensions \((J - size + 1)^2\). Available measures are computed as:
"lr"
The likelihood ratio statistic \(G^2\), computed as: $$2 * \sum_i \sum_j ( n_{ij} * log \frac{n_{ij}}{m_{ij}} )$$
"chi2"
Pearson's \(\chi^2\) statistic, computed as: $$\sum_i \sum_j \frac{(n_{ij} - m_{ij})^2}{m_{ij}}$$
"pmi"
point-wise mutual information score, computed as log \(n_{11}/m_{11}\)
"dice"
the Dice coefficient, computed as \(n_{11}/n_{1.} + n_{.1}\)
"bj"
Blaheta and Johnson's method (called through sequences)
max_size
numeric argument representing the maximum length of the collocations to be scored. The maximum size is currently 3 for all methods except "bj", which has a maximum size of 5.
min_count
minimum frequency of collocations that will be scored
...
additional arguments passed to collocations2 for the first four methods, or to sequences for method = "bj"

Value

is.collocation returns TRUE if the object is of class collocations, FALSE otherwise.

Details

#' @rdname textstat_collocations #' @noRd #' @export textstat_collocations.tokenizedTexts <- function(x, method = c("lr", "chi2", "pmi", "dice", "bj"), ...) method <- match.arg(method) textstat_collocations(as.tokens(x), method = method, ...)

check if an object is collocations object

References

Blaheta, D., & Johnson, M. (2001). http://web.science.mq.edu.au/~mjohnson/papers/2001/dpb-colloc01.pdf. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.

McInnes, B T. 2004. "Extending the Log Likelihood Measure to Improve Collocation Identification." M.Sc. Thesis, University of Minnesota.

Examples

Run this code
txts <- c("quanteda is a package for quantitative text analysis", 
          "quantitative text analysis is a rapidly growing field", 
          "The population is rapidly growing")
toks <- tokens(txts)
textstat_collocations(toks, method = "lr")
textstat_collocations(toks, method = "lr", min_count = 1)
textstat_collocations(toks, method = "lr", max_size = 3, min_count = 1)
(cols <- textstat_collocations(toks, method = "lr", max_size = 3, min_count = 2))
as.tokens(cols)

# extracting multi-part proper nouns (capitalized terms)
toks2 <- tokens(corpus_segment(data_corpus_inaugural, what = "sentence"))
toks2 <- tokens_select(toks2, stopwords("english"), "remove", padding = TRUE)
seqs <- textstat_collocations(toks2, method = "bj", 
                              features = "^([A-Z][a-z\\-]{2,})", 
                              valuetype = "regex", case_insensitive = FALSE)
head(seqs, 10)

# compounding tokens is more efficient when applied to the same tokens object 
toks_comp <- tokens_compound(toks2, seqs)

Run the code above in your browser using DataLab