ll: Compute Log-likelihood Statistics.

Description

Apply the log-likelihood statistic to detect cooccurrences or keywords.

Usage

ll(.Object, ...)
# S4 method for features
ll(.Object)
# S4 method for context
ll(.Object)
# S4 method for cooccurrences
ll(.Object)
# S4 method for Cooccurrences
ll(.Object, verbose = TRUE)

Arguments

.Object: An object of class cooccurrence, context, or features.
...: Further arguments (such as verbose).
verbose: Logical, whether to output messages.

Details

The log-likelihood test to detect cooccurrences is a standard approach to find collocations (Dunning 1993, Evert 2005, 2009).

(a) The basis for computing for the log-likelihood statistic is a contingency table of observationes, which is prepared for every single token in the corpus. It reports counts for a token to inspect and all other tokens in a corpus of interest (coi) and a reference corpus (ref):

	coi	ref	TOTAL
count token	$o_{11}$	$o_{12}$	$r_{1}$
other tokens	$o_{21}$	$o_{22}$	$r_{2}$
TOTAL	$c_{1}$	$c_{2}$	N

(b) Based on the contingency table(s) with observed counts, expected values are calculated for each cell, as the product of the column and margin sums, divided by the overall number of tokens (see example).

(c) The standard formula for calculating the log-likelihood test is as follows. $$G^{2} = 2 \sum{O_{ij} log(\frac{O_{ij}}{E_{ij}})}$$ Note: Before polmineR v0.7.11, a simplification of the formula was used (Rayson/Garside 2000), which omits the third and fourth term of the previous formula: $$ll = 2(o_{11} log (\frac{o_{11}}{E_{11}}) + o_{12} log(\frac{o_{12}}{E_{12}}))$$ There is a (small) gain of computational efficiency using this simplified formula and the result is almost identical with the standard formula; see however the critical discussion of Ulrike Tabbert (2015: 84ff).

The implementation in the ll-method uses a vectorized approach of the computation, which is substantially faster than iterating the rows of a table, generating individual contingency tables etc. As using the standard formula is not significantly slower than relying on the simplified formula, polmineR has moved to the standard computation.

An inherent difficulty of the log likelihood statistic is that it is not possible to compute the statistical test value if the number of observed counts in the reference corpus is 0, i.e. if a term only occurrs exclusively in the neighborhood of a node word. When filtering out rare words from the result table, respective NA values will usually disappear.

References

Dunning, Ted (1993): Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, Vol. 19, No. 1, pp. 61-74.

Rayson, Paul; Garside, Roger (2000): Comparing Corpora using Frequency Profiling. The Workshop on Comparing Corpora. https://aclanthology.org/W00-0901/.

Evert, Stefan (2005): The Statistics of Word Cooccurrences. Word Pairs and Collocations. URN urn:nbn:de:bsz:93-opus-23714. https://elib.uni-stuttgart.de/bitstream/11682/2573/1/Evert2005phd.pdf

Evert, Stefan (2009). Corpora and Collocations. In: A. Ludeling and M. Kyto (eds.), Corpus Linguistics. An International Handbook. Mouton de Gruyter, Berlin, pp. 1212-1248 (ch. 58).

Tabbert, Ulrike (2015): Crime and Corpus. The Linguistic Representation of Crime in the Press. Amsterdam: Benjamins.

Examples

Run this code

# use ll-method explicitly
oil <- cooccurrences("REUTERS", query = "oil", method = NULL)
oil <- ll(oil)
oil_min <- subset(oil, count_coi >= 3)
if (interactive()) View(format(oil_min))
summary(oil)

# use ll-method on 'Cooccurrences'-object
if (FALSE) {
R <- Cooccurrences("REUTERS", left = 5L, right = 5L, p_attribute = "word")
ll(R)
decode(R)
summary(R)
}

# use log likelihood test for feature extraction
x <- partition(
  "GERMAPARLMINI", speaker = "Merkel",
  interjection = "speech", regex = TRUE,
  p_attribute = "word"
)
f <- features(x, y = "GERMAPARLMINI", included = TRUE, method = "ll")
f <- features(x, y = "GERMAPARLMINI", included = TRUE, method = NULL)
f <- ll(f)
summary(f)

if (FALSE) {

# A sample do-it-yourself calculation for log-likelihood:
# Compute ll-value for query "oil", and "prices"

oil <- context("REUTERS", query = "oil", left = 5, right = 5)

# (a) prepare matrix with observed values
o <- matrix(data = rep(NA, 4), ncol = 2) 
o[1,1] <- as(oil, "data.table")[word == "prices"][["count_coi"]]
o[1,2] <- count("REUTERS", query = "prices")[["count"]] - o[1,1]
o[2,1] <- size(oil)[["coi"]] - o[1,1]
o[2,2] <- size(oil)[["ref"]] - o[1,2]


# (b) prepare matrix with expected values, calculate margin sums first
r <- rowSums(o)
c <- colSums(o)
N <- sum(o)

e <- matrix(data = rep(NA, 4), ncol = 2) # matrix with expected values
e[1,1] <- r[1] * (c[1] / N)
e[1,2] <- r[1] * (c[2] / N)
e[2,1] <- r[2] * (c[1] / N)
e[2,2] <- r[2] * (c[2] / N)


# (c) compute log-likelihood value
ll_value <- 2 * (
  o[1,1] * log(o[1,1] / e[1,1]) +
  o[1,2] * log(o[1,2] / e[1,2]) +
  o[2,1] * log(o[2,1] / e[2,1]) +
  o[2,2] * log(o[2,2] / e[2,2])
)

df <- as.data.frame(cooccurrences("REUTERS", query = "oil"))
subset(df, word == "prices")[["ll"]]
}

Run the code above in your browser using DataLab

	coi	ref	TOTAL
count token	\(o_{11}\)	\(o_{12}\)	\(r_{1}\)
other tokens	\(o_{21}\)	\(o_{22}\)	\(r_{2}\)
TOTAL	\(c_{1}\)	\(c_{2}\)	N