Documents are grouped for the purposes of scoring, but collocations will not span sentences.
If x
is a tokens object and some tokens have been removed, this should be done
using tokens_remove(x, pattern, padding = TRUE)
so that counts will still be
accurate, but the pads will prevent those collocations from being scored.
The lambda
computed for a size = \(K\)-word target multi-word
expression the coefficient for the \(K\)-way interaction parameter in the
saturated log-linear model fitted to the counts of the terms forming the set
of eligible multi-word expressions. This is the same as the "lambda" computed
in Blaheta and Johnson's (2001), where all multi-word expressions are
considered (rather than just verbs, as in that paper). The z
is the
Wald \(z\)-statistic computed as the quotient of lambda
and the Wald
statistic for lambda
as described below.
In detail:
Consider a \(K\)-word target expression \(x\), and let \(z\) be any
\(K\)-word expression. Define a comparison function \(c(x,z)=(j_{1},
\dots, j_{K})=c\) such that the \(k\)th element of \(c\) is 1 if the
\(k\)th word in \(z\) is equal to the \(k\)th word in \(x\), and 0
otherwise. Let \(c_{i}=(j_{i1}, \dots, j_{iK})\), \(i=1, \dots,
2^{K}=M\), be the possible values of \(c(x,z)\), with \(c_{M}=(1,1,
\dots, 1)\). Consider the set of \(c(x,z_{r})\) across all expressions
\(z_{r}\) in a corpus of text, and let \(n_{i}\), for \(i=1,\dots,M\),
denote the number of the \(c(x,z_{r})\) which equal \(c_{i}\), plus the
smoothing constant smoothing
. The \(n_{i}\) are the counts in a
\(2^{K}\) contingency table whose dimensions are defined by the
\(c_{i}\).
\(\lambda\): The \(K\)-way interaction parameter in the saturated
loglinear model fitted to the \(n_{i}\). It can be calculated as
$$\lambda = \sum_{i=1}^{M} (-1)^{K-b_{i}} \log n_{i}$$
where \(b_{i}\) is the number of the elements of \(c_{i}\) which are
equal to 1.
Wald test \(z\)-statistic \(z\) is calculated as:
$$z = \frac{\lambda}{[\sum_{i=1}^{M} n_{i}^{-1}]^{(1/2)}}$$