textstat_collocations: calculate collocation statistics

Description

Identify and score collocations from a corpus, character, or tokens object, with targeted selection.

#' @rdname textstat_collocations #' @noRd #' @export textstat_collocations.character <- function(x, method = c("lr", "chi2", "pmi", "dice", "bj"), ...) method <- match.arg(method) textstat_collocations(tokens(x), method = method, ...)

Usage

textstat_collocations(x, method = c("lr", "chi2", "pmi", "dice", "bj"),
  max_size = 3, min_count = 2, ...)
is.collocations(x)

Arguments

a character, corpus, or tokens object to be mined for collocations

method

association measure for detecting collocations. Let $i$ index documents, and $j$ index features, $n_{ij}$ refers to observed counts, and $m_{ij}$ the expected counts in a collocations frequency table of dimensions $(J - size + 1)^2$. Available measures are computed as:

"lr": The likelihood ratio statistic $G^2$, computed as: $$2 * \sum_i \sum_j ( n_{ij} * log \frac{n_{ij}}{m_{ij}} )$$
"chi2": Pearson's $\chi^2$ statistic, computed as: $$\sum_i \sum_j \frac{(n_{ij} - m_{ij})^2}{m_{ij}}$$
"pmi": point-wise mutual information score, computed as log $n_{11}/m_{11}$
"dice": the Dice coefficient, computed as $n_{11}/n_{1.} + n_{.1}$
"bj": Blaheta and Johnson's method (called through sequences)

max_size

numeric argument representing the maximum length of the collocations to be scored. The maximum size is currently 3 for all methods except "bj", which has a maximum size of 5.

min_count

minimum frequency of collocations that will be scored

...

additional arguments passed to collocations2 for the first four methods, or to sequences for method = "bj"

Value

is.collocation returns TRUE if the object is of class collocations, FALSE otherwise.

Details

#' @rdname textstat_collocations #' @noRd #' @export textstat_collocations.tokenizedTexts <- function(x, method = c("lr", "chi2", "pmi", "dice", "bj"), ...) method <- match.arg(method) textstat_collocations(as.tokens(x), method = method, ...)

check if an object is collocations object

References

Blaheta, D., & Johnson, M. (2001). http://web.science.mq.edu.au/~mjohnson/papers/2001/dpb-colloc01.pdf. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.

McInnes, B T. 2004. "Extending the Log Likelihood Measure to Improve Collocation Identification." M.Sc. Thesis, University of Minnesota.

Examples

Run this code

txts <- c("quanteda is a package for quantitative text analysis", 
          "quantitative text analysis is a rapidly growing field", 
          "The population is rapidly growing")
toks <- tokens(txts)
textstat_collocations(toks, method = "lr")
textstat_collocations(toks, method = "lr", min_count = 1)
textstat_collocations(toks, method = "lr", max_size = 3, min_count = 1)
(cols <- textstat_collocations(toks, method = "lr", max_size = 3, min_count = 2))
as.tokens(cols)

# extracting multi-part proper nouns (capitalized terms)
toks2 <- tokens(corpus_segment(data_corpus_inaugural, what = "sentence"))
toks2 <- tokens_select(toks2, stopwords("english"), "remove", padding = TRUE)
seqs <- textstat_collocations(toks2, method = "bj", 
                              features = "^([A-Z][a-z\\-]{2,})", 
                              valuetype = "regex", case_insensitive = FALSE)
head(seqs, 10)

# compounding tokens is more efficient when applied to the same tokens object 
toks_comp <- tokens_compound(toks2, seqs)

Run the code above in your browser using DataLab