textstat_keyness: calculate keyness statistics

Description

calculate keyness statistics

Usage

textstat_keyness(x, target = 1L, measure = c("chi2", "exact", "lr"),
  sort = TRUE)

Arguments

a dfm containing the features to be examined for keyness

target

the document index (numeric, character or logical) identifying the document forming the "target" for computing keyness; all other documents' feature frequencies will be combined for use as a reference

measure

(signed) association measure to be used for computing keyness. Currenly available: "chi2" (\(chi^2\) with Yates correction); "exact" (Fisher's exact test); "lr" for the likelihood ratio \(G\) statistic with Yates correction.

sort

logical; if TRUE sort features scored in descending order of the measure, otherwise leave in original feature order

Value

a data.frame of computed statistics and associated p-values, where the features scored name each row, and the number of occurrences for both the target and reference groups. For measure = "chi2" this is the chi-squared value, signed positively if the observed value in the target exceeds its expected value; for measure = "exact" this is the estimate of the odds ratio; for measure = "lr" this is the likelihood ratio \(G\) statistic.

References

Bondi, Marina, and Mike Scott, eds. 2010. Keyness in Texts. Amsterdam, Philadelphia: John Benjamins, 2010.

Stubbs, Michael. 2010. "Three Concepts of Keywords". In Keyness in Texts, Marina Bondi and Mike Scott, eds. pp21<U+2013>42. Amsterdam, Philadelphia: John Benjamins.

Scott, M. & Tribble, C. 2006. Textual Patterns: keyword and corpus analysis in language education. Amsterdam: Benjamins, p. 55.

Dunning, Ted. 1993. "Accurate Methods for the Statistics of Surprise and Coincidence", Computational Linguistics, Vol 19, No. 1, pp. 61-74.

Examples

Run this code

# compare pre- v. post-war terms using grouping
period <- ifelse(docvars(data_corpus_inaugural, "Year") < 1945, "pre-war", "post-war")
mydfm <- dfm(data_corpus_inaugural, groups = period)
head(mydfm) # make sure 'post-war' is in the first row
head(result <- textstat_keyness(mydfm), 10)
tail(result, 10)

# compare pre- v. post-war terms using logical vector
mydfm2 <- dfm(data_corpus_inaugural)
textstat_keyness(mydfm2, docvars(data_corpus_inaugural, "Year") >= 1945)

# compare Trump 2017 to other post-war preseidents
pwdfm <- dfm(corpus_subset(data_corpus_inaugural, period == "post-war"))
head(textstat_keyness(pwdfm, target = "2017-Trump"), 10)
# using the likelihood ratio method
head(textstat_keyness(dfm_smooth(pwdfm), measure = "lr", target = "2017-Trump"), 10)

Run the code above in your browser using DataLab