Learn R Programming

polmineR (version 0.8.9)

t_test: Perform t-test.

Description

Compute t-scores to find collocations.

Usage

t_test(.Object)

# S4 method for context t_test(.Object)

Arguments

.Object

A context or features object

Details

The calculation of the t-test is based on the formula $$t = \frac{\overline{x} - \mu}{\sqrt{\frac{s^2}{N}}}$$ where \(\mu\) is the mean of the distribution, x the sample mean, \(s^2\) the sample variance, and N the sample size.

Following Manning and Schuetze (1999), to test whether two tokens (a and b) are a collocation, the sample mean \(\mu\) is the number of observed co-occurrences of a and b divided by corpus size N: $$\mu = \frac{o_{ab}}{N}$$

For the mean of the distribution \(\overline{x}\), maximum likelihood estimates are used. Given that we know the number of observations of token a, \(o_{a}\), the number of observations of b, \(o_{b}\) and the size of the corpus N, the propabilities for the tokens a and b, and for the co-occcurence of a and be are as follows, if independence is assumed: $$P(a) = \frac{o_{a}}{N}$$ $$P(b) = \frac{o_{b}}{N}$$ $$P(ab) = P(a)P(b)$$

See the examples for a sample calulation of the t-test, and Evert (2005: 83) for a critical discussion of the "highly questionable" assumptions when using the t-test for detecting co-occurrences.

References

Manning, Christopher D.; Schuetze, Hinrich (1999): Foundations of Statistical Natural Language Processing. MIT Press: Cambridge, Mass., pp. 163-166.

Church, Kenneth W. et al. (1991): Using Statistics in Lexical Analysis. In: Uri Zernik (ed.), Lexical Acquisition. Hillsdale, NJ:Lawrence Erlbaum, pp. 115-164 tools:::Rd_expr_doi("https://doi.org/10.4324/9781315785387-8")

Evert, Stefan (2005): The Statistics of Word Cooccurrences. Word Pairs and Collocations. URN urn:nbn:de:bsz:93-opus-23714. https://elib.uni-stuttgart.de/bitstream/11682/2573/1/Evert2005phd.pdf

See Also

Other statistical methods: chisquare(), ll(), pmi()

Examples

Run this code
use("polmineR")
y <- cooccurrences("REUTERS", query = "oil", left = 1L, right = 0L, method = "t_test")
# The critical value (for a = 0.005) is 2.579, so "crude" is a collocation
# of "oil" according to t-test.

# A sample calculation
count_oil <- count("REUTERS", query = "oil")
count_crude <- count("REUTERS", query = "crude")
count_crude_oil <- count("REUTERS", query = '"crude" "oil"', cqp = TRUE)

p_crude <- count_crude$count / size("REUTERS")
p_oil <- count_oil$count / size("REUTERS")
p_crude_oil <- p_crude * p_oil

x <- count_crude_oil$count / size("REUTERS")

t_value <- (x - p_crude_oil) / sqrt(x / size("REUTERS"))
# should be identical with previous result:
as.data.frame(subset(y, word == "crude"))$t_test

Run the code above in your browser using DataLab