dsm.score: Weighting, Scaling and Normalisation of Co-occurrence Matrix (wordspace)

Description

Compute feature scores for a term-document or term-term co-occurrence matrix, using one of several standard association measures. Scores can optionally be rescaled with an isotonic transformation function and centered or standardized. In addition, row vectors can be normalized to unit length wrt. a given norm.

This function has been optimized for efficiency and low memory overhead.

Usage

dsm.score(model, score = "frequency",
          sparse = TRUE, negative.ok = NA,
          transform = c("none", "log", "root", "sigmoid"),
          scale = c("none", "standardize", "center", "scale"),
          normalize = FALSE, method = "euclidean", p = 2, tol = 1e-6,
          matrix.only = FALSE, update.nnzero = FALSE,
          batchsize = 1e6, gc.iter = Inf)

Value

Either an updated DSM model of class dsm (default) or the matrix of (scaled and normalised) association scores (matrix.only=TRUE).

Note that updating DSM models may require a substantial amount of temporary memory (because of the way memory management is implemented in R). This can be problematic when running a 32-bit build of R or when dealing with very large DSM models, so it may be better to return only the scored matrix in such cases.

Arguments

model: a DSM model, i.e. an object of class dsm
score: the association measure to be used for feature weighting; either a character string naming one of the built-in measures or a user-defined function (see “Details” below)
sparse: if TRUE (the default), compute sparse non-negative association scores (see “Details” below). Non-sparse association scores are only allowed if negative.ok=TRUE.
negative.ok: whether operations that introduce negative values into the score matrix (non-sparse association scores, standardization of columns, etc.) are allowed. The default (negative.ok=NA) is TRUE if the co-occurrence matrix $M$ is dense, and FALSE if it is sparse. See “Details” below for the special value negative.ok="nonzero".
transform: scale transformation to be applied to association scores (see “Details” below)
scale: if not "none", standardize columns of the scored matrix by z-transformation ("standardize"), center them without rescaling ("center"), or scale to unit RMS without centering ("scale")
normalize: if TRUE normalize row vectors of scored matrix to unit length, according to the norm indicated by method and p
method, p: norm to be used with normalize=TRUE. See rowNorms for admissible values and details on the corresponding norms
tol: if normalize=TRUE, row vectors with norm below tol are explicitly set to all zeroes instead of attempting to normalize them (see normalize.rows for more information)
matrix.only: whether to return updated DSM model (default) or only the matrix of scores (matrix.only=TRUE)
update.nnzero: if TRUE and a full DSM model is returned, update the counts of nonzero entries in rows and columns according to the matrix of scores (there may be fewer nonzero entries with sparse association scores, or more from dense association scores and/or column scaling)
batchsize: if score is a user-defined function, the co-occurrence matrix is divided into blocks of approx. batchsize elements each in order to reduce memory overhead
gc.iter: how often to run the garbage collector when computing user-defined association scores; gc() is called after every gc.iter batches in order to reclaim temporary data and keep memory overhead as low as possible. This option should only be specified if memory is very tight, since garbage collector runs can be expensive (e.g. when there are many distinct strings in the workspace). With the default value gc.iter=Inf, no calls to gc() will be made; gc.iter=4 seems to give a good trade-off between memory overhead and degraded performance.

Author

Stephanie Evert (https://purl.org/stephanie.evert)

Details

Association measures

Association measures (AM) for feature scoring are defined in the notation of Evert (2008). The most important symbols are $O_{11} = O$ for the observed co-occurrence frequency, $E_{11} = E$ for the co-occurrence frequency expected under a null hypothesis of independence, $R_1$ for the marginal frequency of the target term, $C_1$ for the marginal frequency of the feature term or context, and $N$ for the sample size of the underlying corpus. Evert (2008) explains in detail how these values are computed for different types of co-occurrence; practical examples can be found in the distributional semantics tutorial at http://wordspace.collocations.de/.

Several commonly used AMs are implemented in optimized C++ code for efficiency and minimal memory overhead. They are selected by name, which is passed as a character string in the score argument. See below for a list of built-in measures and their full equations.

Other AMs can be applied by passing a user-defined function in the score argument. See “User-defined association measures” at the end of this section for details.

Built-in association measures

The names of the following measures can be abbreviated to a unique prefix. Equations are given in the notation of Evert (2008).

frequency (default)

Co-occurrence frequency: $$ O_{11} $$ Use this association measure to operate on raw, unweighted co-occurrence frequency data.

MI

(Pointwise) Mutual Information, a log-transformed version of the ratio between observed and expected co-occurrence frequency: $$ \log_2 \frac{O_{11}}{E_{11}} $$ Pointwise MI has a very strong bias towards pairs with low expected co-occurrence frequency (because of $E_{11}$ in the denominator). It should only be applied if low-frequency targets and features have been removed from the DSM.

The sparse version of MI (with negative scores cut off at 0) is sometimes referred to as "positive pointwise Mutual Information" (PPMI) in the literature.

log-likelihood

The $G^2$ statistic of a likelihood ratio test for independence of rows and columns in a contingency table, which is very popular in computational linguistics under the name log-likelihood: $$ \pm 2 \left( \sum_{ij} O_{ij}\cdot \log \frac{O_{ij}}{E_{ij}} \right) $$ This implementation computes signed association scores, which are negative iff $O_{11} < E_{11}$. Log-likelihood has a strong bias towards high co-occurrence frequency and often produces a highly skewed distribution of scores. It may therefore be advisable to combine it with an additional log transformation.

simple-ll

Simple log-likelihood (Evert 2008, p. 1225): $$ \pm 2 \left( O_{11}\cdot \log \frac{O_{11}}{E_{11}} - (O_{11} - E_{11}) \right) $$ This measure provides a good approximation to the full log-likelihood measure (Evert 2008, p. 1235), but can be computed much more efficiently. It is also very similar to the local-MI measure used by several popular DSMs.

Like log-likelihood, this measure computes signed association scores and has a strong bias towards high co-occurrence frequency.

t-score

The t-score association measure, which is popular for collocation identification in computational lexicography: $$ \frac{O_{11} - E_{11}}{\sqrt{O_{11}}} $$ T-score is known to filter out low-frequency data effectively. If used as a non-sparse measure, a “discounted” version with $\sqrt(O + 1)$ in the denominator is computed.

chi-squared

The $X^2$ statistic of Pearson's chi-squared test for independence of rows and columns in a contingency table, with Yates's correction applied: $$ \pm \frac{ N \bigl( | O_{11}O_{22} - O_{12} O_{21} | - N/2 \bigr)^2 }{ R_1 R_2 C_1 C_2 } $$ This implementation computes signed association scores, which are negative iff $O_{11} < E_{11}$.

The formula above gives a more compact form of Yates's correction than the familiar sum over the four cells of the contingency table.

z-score

The z-score association measure, based on a normal approximation to the binomial distribution of co-occurrence by chance: $$ \frac{O_{11} - E_{11}}{\sqrt{E_{11}}} $$ Z-score has a strong bias towards pairs with low expected co-occurrence frequency (because of $E_{11}$ in the denominator). It should only be applied if low-frequency targets and features have been removed from the DSM.

Dice

The Dice coefficient of association, which corresponds to the harmonic mean of the conditional probabilities $P(\mathrm{feature}|\mathrm{target})$ and $P(\mathrm{target}|\mathrm{feature})$: $$ \frac{2 O_{11}}{R_1 + C_1} $$ Note that Dice is inherently sparse: it preserves zeroes and does not produce negative scores.

The following additional scoring functions can be selected:

tf.idf: The tf-idf weighting scheme popular in Information Retrieval: $$ O_{11}\cdot \log \frac{1}{\mathit{df}} $$ where $\mathit{df}$ is the relative document frequency of the corresponding feature term and should be provided as a variable df in the model's column information. Otherwise, it is approximated by the feature's nonzero count $n_p$ (variable nnzero) divided by the number $K$ of rows in the co-occurrence matrix: $$ \mathit{df} = \frac{n_p + 1}{K + 1} $$ The discounting avoids division-by-zero errors when $n_p = 0$.
reweight: Apply scale transformation, column scaling and/or row normalization to previously computed feature scores (from model$S). This is the only score that can be used with a DSM that does not contain raw co-occurrence frequency data.

Sparse association scores

If sparse=TRUE, negative association scores are cut off at 0 in order to (i) ensure that the scored matrix is non-negative and (ii) preserve sparseness. The implementation assumes that association scores are always $\leq 0$ for $O_{11} = 0$ in this case and only computes scores for nonzero entries in a sparse matrix. All built-in association measures satisfy this criterion.

Other researchers sometimes refer to such sparse scores as "positive" measures, most notably positive point-wise Mutual Information (PPMI). Since sparse=TRUE is the default setting, score="MI" actually computes the PPMI measure.

Non-sparse association scores can only be computed if negative.ok=TRUE and will force a dense matrix representation. For this reason, the default is FALSE for a sparse co-occurrence matrix and TRUE for a dense one. A special setting negative.ok="nonzero" is provided for those who wish to abuse dsm.score for collocation analysis. In combination with sparse=FALSE, it will allow negative score values, but compute them only for the nonzero entries of a sparse co-occurrence matrix. For a dense co-occurrence matrix, this setting is fully equivalent to negative.ok=TRUE.

Scale transformations

Association scores can be re-scaled with an isotonic transformation function that preserves sign and ranking of the scores. This is often done in order to de-skew the distribution of scores or as an approximate binarization (presence vs. absence of features). The following built-in transformations are available:

none (default): A linear transformation leaves association scores unchanged. $$ f(x) = x $$
log: The logarithmic transformation has a strong de-skewing effect. In order to preserve sparseness and sign of association scores, a signed and discounted version has been implemented. $$ f(x) = \mathop{\mathrm{sgn}}(x) \cdot \log (|x| + 1) $$
root: The signed square root transformation has a mild de-skewing effect. $$ f(x) = \mathop{\mathrm{sgn}}(x) \cdot \sqrt{|x|} $$
sigmoid: The sigmoid transformation produces a smooth binarization where negative values saturate at $-1$, positive values saturate at $+1$ and zeroes remain unchanged. $$ f(x) = \tanh x $$

User-defined association measures

Instead of the name of a built-in AM, a function implementing a user-defined measure can be passed in the score argument. This function will be applied to the co-occurrence matrix in batches of approximately batchsize elements in order to limit the memory overhead incurred. A user-defined AM can be combined with any of the transformations above, and sparse=TRUE will cut off all negative scores.

The user function can use any of following arguments to access the contingency tables of observed and expected frequencies, following the notation of Evert (2008):

O, E: observed and expected co-occurrence frequency
R1, R2, C1, C2: the row and column marginals of the contingency table
N: sample size
f, f1, f2: the frequency signature of a target-feature pair, a different notation for $f = O$, $f_1 = R_1$ and $f_2 = C_1$
O11, O12, O21, O22: the contingency table of observed frequencies
E11, E12, E21, E22: the contingency table of expected frequencies
rows: a data frame containing information about the target items (from the rows element of model)
cols: a data frame containing information about the feature items (from the cols element of model)
...: must be specified to ignore unused arguments

Except for rows and cols, all these arguments will be numeric vectors of the same lengths or scalar values (N), and the function must return a numeric vector of the same length.

For example, the built-in Mutual Information measure could also be implemented with the user function

    my.MI <- function (O, E, ...) log2(O / E)

and tf.idf scoring could be implemented as follows, provided that the feature information table model$cols contains a column df with relative document frequencies:

    my.tfidf <- function (O11, cols, ...) O11 * log(1 / cols$df)
    dsm.score(model, score=my.tfidf)

Warning: User-defined AMs are much less efficient than the built-in measures and should only be used on large data sets if there is a good reason to do so. Increasing batchsize may speed up the computation to some degree at the expense of bigger memory overhead.

References

More information about assocation measures and the notation for contingency tables can be found at http://www.collocations.de/ and in

Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212--1248. Mouton de Gruyter, Berlin, New York.

Examples

Run this code

model <- DSM_TermTerm
model$M # raw co-occurrence matrix
  
model <- dsm.score(model, score="MI")
round(model$S, 3) # PPMI scores
  
model <- dsm.score(model, score="reweight", transform="sigmoid")
round(model$S, 3) # additional sigmoid transformation

## user-defined scoring functions can implement additional measures,
## e.g. the conditional probability Pr(feature | target) as a percentage
my.CP <- function (O11, R1, ...) 100 * O11 / R1  # "..." is mandatory
model <- dsm.score(model, score=my.CP)
round(model$S, 3)

## shifted PPMI (with k = 2) creates all-zero rows and columns
model <- dsm.score(model, score=function (O, E, ...) log2(O / E) - 2,
                   normalize=TRUE, update.nnzero=TRUE)
round(model$S, 3) # normalization preserves all-zero rows
## use subset to remove such rows and columns
m2 <- subset(model, nnzero > 0, nnzero > 0) # must have updated nnzero counts
round(m2$S, 3)

if (FALSE) {
# visualization of the scale transformations implemented by dsm.score
x <- seq(-2, 4, .025)
plot(x, x, type="l", lwd=2, xaxs="i", yaxs="i", xlab="x", ylab="f(x)")
abline(h=0, lwd=0.5); abline(v=0, lwd=0.5)
lines(x, sign(x) * log(abs(x) + 1), lwd=2, col=2)
lines(x, sign(x) * sqrt(abs(x)), lwd=2, col=3)
lines(x, tanh(x), lwd=2, col=4)
legend("topleft", inset=.05, bg="white", lwd=3, col=1:4,
       legend=c("none", "log", "root", "sigmoid"))
}

Run the code above in your browser using DataLab