Compute best-practice keyness measures (according to Evert 2022) for the frequency comparison of lexical items in two corpora. The function is fully vectorised and should be applied to a complete data set of candidate items (so statistical analysis can be adjusted to control the family-wise error rate).
keyness(f1, n1, f2, n2, measure=c("LRC", "PositiveLRC", "G2", "LogRatio", "SimpleMaths"),
conf.level=.95, alpha=NULL, p.adjust=TRUE, lambda=1)
A numeric vector of the same length as f1
and f2
, containing keyness scores for all candidate lexical items.
For most measures, positive scores indicate positive keywords (i.e. higher frequency in the population underlying corpus A)
and negative scores indicate negative keywords (i.e. higher frequency in the population underlying corpus B).
If alpha
is specified, non-significant candidates always have a score of 0.
a numeric vector specifying the frequencies of candidate items in corpus A (target corpus)
sample size of target corpus, i.e. the total number of tokens in corpus A (usually a scalar, but can also be a vector parallel to f1
)
a numeric vector parallel to f1
, specifying the frequencies of candidate items in corpus B (reference corpus)
sample size of reference corpus, i.e. the total number of tokens in corpus B (usually a scalar, but can also be a vector parallel to f2
)
the keyness measure to be computed (see “Details” below)
the desired confidence level for the LRC
and PositiveLRC
measures (defaults to 95%)
if specified, filter out candidate items whose frequency difference between \(f_1\) and \(f_2\) is not significant at level \(\alpha\). This is achieved by setting the score of such candidates to 0.
if TRUE
, apply a Bonferroni correction in order to control the family-wise error rate across all tests carried out
in a single function call (i.e. the common length of f1
and f2
).
Alternatively, the desired family size can be specified instead of TRUE
(useful if a larger data set is processed in batches).
The adjustment applied both the the significance filter (alpha
) and the confidence intervals (conf.level
) underlying LRC
and PositiveLRC
measures.
parameter \(\lambda\) of the SimpleMaths
measure.
Stephanie Evert (https://purl.org/stephanie.evert)
This function computes a range of best-practice keyness measures comparing the relative frequencies
\(\pi_1\) and \(\pi_2\) of lexical items in populations (i.e. sublanguages) A and B,
based on the observed sample frequencies \(f_1, f_2\) and the corresponding sample sizes \(n_1, n_2\).
The function is fully vectorised with respect to arguments f1
, f2
, n1
and n2
,
but only a single keyness measure can be selected for each function call.
All implemented measures are robust for the corner cases \(f_1 = 0\) and \(f_2 = 0\), but \(f_1 = f_2 = 0\) is not allowed.
Most of the keyness measures are directional,
i.e. positive scores indicate positive keyness in A (\(\pi_1 > \pi_2\))
and negative scores indicate negative keyness in A (\(\pi_1 < \pi_2\)).
By contrast, the one-sided measures PositiveLRC
and SimpleMaths
only detect positive keyness in A,
returning small (and possibly negative) scores otherwise, i.e. in case of insufficient evidence for \(\pi_1 > \pi_2\)
and in case of strong evidence for \(\pi_1 < \pi_2\).
One-sided measures can be useful for a ranking of the entire data set as positive keyword candidates.
Hardie (2014) and other authors recommend to combine effect-size measures (in particular LogRatio
) with
a significance filter in order to weed out candidate items for which there is no significant evidence
against the null hypothesis \(H_0: \pi_1 = \pi_2\). Such a filter is activated by specifying the desired
significance level alpha
, and can be combined with all keyness measures.
In this case, the scores of all non-significant candidate items are set to 0.
The decision is based in the likelihood-ratio test implemented by the G2
measure
and its asymptotic \(\chi^2_1\) distribution under \(H_0\).
Note that the significance filter can also be applied to the G2
measure itself, setting all scores
below the critical value for the significance test to 0.
For one-sided measures (PositiveLRC
and SimpleMaths
), candidates with significant evidence
for negative keyness are also filtered out (i.e. their scores are set to 0) in order to ensure a consistent ranking.
By default, statistical inference corrects for multiple testing in order to control family-wise error rates.
This applies to the significance filter as well as to the confidence intervals underlying LRC
and PositiveLRC
.
Note that the G2
scores themselves are never adjusted (only the critical value for the significance filter is modified).
Family size \(m\) is automatically determined from the number of candidate items processed in a single function call.
Alternatively, the family size can be specified explicitly in the p.adjust
argument, e.g. if a large data set
is processed in multiple batches, or p.adjust=FALSE
can be used to disable the correction.
For the adjustment, a highly conservative Bonferroni correction \(\alpha' = \alpha / m\) is applied to significance levels. Since the large candidate sets and sample sizes often found in corpus linguistics tend to produce large numbers of false positives, this conservative approach is considered to be useful.
See Evert (2022) and its supplementary materials for a more detailed discussion of the implemented best-practice measures and some alternatives.
G2
The log-likelihood measure (Rayson & Garside 2003: 3) computes the score \(G^2\) of a likelihood-ratio test for \(H_0: \pi_1 = \pi_2\). This test is two-sided and always returns positive values, so the sign of its score is inverted for \(f_1 / n_1 < f_2 / n_2\) in order to obtain a directional keyness measure. As a pure significance measure, it tends to prefer high-frequency candidates with large \(f_1\).
LogRatio
A point estimate of the log relative risk \(\log_2 (\pi_1 / \pi_2)\), which has been suggested as an intuitive keyness measure under the name LogRatio by Hardie (2014: 45). The implementation uses Walter's (1975) adjusted estimator $$% \log_2 \dfrac{f_1 + \frac12}{n_1 + \frac12} - \log_2 \dfrac{f_2 + \frac12}{n_2 + \frac12} $$ which is less biased and robust against \(f_i = 0\). As a pure effect-size measure, LogRatio tends to assign spuriously high scores to low-frequency candidates with small \(f_1\) and \(f_2\) (due to sampling variation). Combination with a significance filter is strongly recommended.
LRC
(default)A conservative estimate for LogRatio recommended by Evert (2022) in order to combine
and balance the advantages of effect-size and significance measures.
A confidence interval (according to the specified conf.level
) for relative risk \(r = \pi_1 / \pi_2\)
is obtained from an exact conditional Poisson test (Fay 2010: 55), adjusted for multiple testing by default.
If a candidate is not significant (i.e. the confidence interval includes \(H_0: r = 1\)) its score is set to 0.
Otherwise the boundary of the confidence interval closer to 1 is taken as a conservative directional estimate
of \(r\) and its \(\log_2\) is returned.
PositiveLRC
A one-sided variant of LRC, which returns the lower boundary of a one-sided confidence interval for \(\log_2 r\). Scores \(\leq 0\) indicate that there is no significant evidence for positive keyness. The directional version of LRC is recommended for general use, but PositiveLRC may be preferred if the hermeneutic interpretation should also consider non-significant candidates (especially with small data sets).
SimpleMaths
The simple maths keyness measure (Kilgarriff 2009) used by the commercial corpus analysis platform Sketch Engine: $$ \dfrac{10^6 \cdot \frac{f_1}{n_1} + \lambda}{10^6 \cdot \frac{f_2}{n_2} + \lambda} $$ Its frequency bias can be adjusted with the user parameter \(\lambda > 0\). The scaling factor \(10^6\) was chosen so that \(\lambda = 1\) is a practical default value.
There does not appear to be a convincing mathematical justification behind this measure. It is included here only because of the popularity of the Sketch Engine platform.
Evert, S. (2022). Measuring keyness. In Digital Humanities 2022: Conference Abstracts, pages 202-205, Tokyo, Japan / online. https://osf.io/cy6mw/
Fay, Michael P. (2010). Two-sided exact tests and matching confidence intervals for discrete data. The R Journal, 2(1), 53-58.
Hardie, A. (2014). A single statistical technique for keywords, lockwords, and collocations. Internal CASS working paper no. 1, unpublished.
Kilgarriff, A. (2009). Simple maths for keywords. In Proceedings of the Corpus Linguistics 2009 Conference, Liverpool, UK.
Rayson, P. and Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the ACL Workshop on Comparing Corpora, pages 1-6, Hong Kong.
Walter, S. D. (1975). The distribution of Levin’s measure of attributable risk. Biometrika, 62(2): 371-374.
prop.cint
, which is used by the exact conditional Poisson test of the LRC measure
# compute all keyness measures for a single candidate item with f1=7, f2=2 and n1=n2=1000
keyness(7, 1000, 2, 1000, measure="G2") # log-likelihood
keyness(7, 1000, 2, 1000, measure="LogRatio")
keyness(7, 1000, 2, 1000, measure="LogRatio", alpha=0.05) # with significance filter
keyness(7, 1000, 2, 1000, measure="LRC") # the default measure
keyness(7, 1000, 2, 1000, measure="PositiveLRC")
keyness(7, 1000, 2, 1000, measure="SimpleMaths")
# a practical example: keywords of spoken British English (from BNC corpus)
n1 <- sum(BNCcomparison$spoken) # sample sizes
n2 <- sum(BNCcomparison$written)
kw <- transform(BNCcomparison,
G2 = keyness(spoken, n1, written, n2, measure="G2"),
LogRatio = keyness(spoken, n1, written, n2, measure="LogRatio"),
LRC = keyness(spoken, n1, written, n2))
kw <- kw[order(-kw$LogRatio), ]
head(kw, 20)
# collocations of "in charge of" with LRC as an association measure
colloc <- transform(BNCInChargeOf,
PosLRC = keyness(f.in, N.in, f.out, N.out, measure="PositiveLRC"))
colloc <- colloc[order(-colloc$PosLRC), ]
head(colloc, 30)
Run the code above in your browser using DataLab