dsm.score: Weighting, Scaling and Normalisation of Co-occurrence Matrix (wordspace)

Description

Compute feature scores for a term-document or term-term co-occurrence matrix, using one of several standard association measures. Scores can optionally be rescaled with an isotonic transformation function and centered or standardized. In addition, row vectors can be normalized to unit length wrt. a given norm.

This function has been optimized for efficiency and low memory overhead.

Usage

dsm.score(model, score = "frequency",
          sparse = TRUE, negative.ok = NA,
          transform = c("none", "log", "root", "sigmoid"),
          scale = c("none", "standardize", "center", "scale"),
          normalize = FALSE, method = "euclidean", p = 2,
          matrix.only = FALSE, update.nnzero = FALSE)

Arguments

model

a DSM model, i.e. an object of class dsm

score

the association measure to be used for feature weighting (see “Details” below)

sparse

if TRUE (the default), compute sparse non-negative association scores (see “Details” below). Non-sparse association scores are only allowed if negative.ok=TRUE.

negative.ok

whether operations that introduce negative values into the score matrix (non-sparse association scores, standardization of columns, etc.) are allowed. The default (negative.ok=NA) is TRUE if the co-occurrence matrix $M$ is dense, and FALSE if it is sparse.

transform

scale transformation to be applied to association scores (see “Details” below)

scale

if not "none", standardize columns of the scored matrix by z-transformation ("standardize"), center them without rescaling ("center"), or scale to unit RMS without centering ("scale")

normalize

if TRUE normalize row vectors of scored matrix to unit length, according to the norm indicated by method and p

method, p

norm to be used with normalize=TRUE. See rowNorms for admissible values and details on the corresponding norms

matrix.only

whether to return updated DSM model (default) or only the matrix of scores (matrix.only=TRUE)

update.nnzero

if TRUE and a full DSM model is returned, update the counts of nonzero entries in rows and columns according to the matrix of scores (there may be fewer nonzero entries with sparse association scores, or more from dense association scores and/or column scaling)

Value

Either an updated DSM model of class dsm (default) or the matrix of (scaled and normalised) association scores (matrix.only=TRUE).

Note that updating DSM models may require a substantial amount of temporary memory (because of the way memory management is implemented in R). This can be problematic when running a 32-bit build of R or when dealing with very large DSM models, so it may be better to return only the scored matrix in such cases.

Details

Association measures

The following association measures can be used for feature scoring. Equations are given in the notation of Evert (2008). The most important symbols are $O_{11}$ for the observed co-occurrence frequency, $E_{11}$ for the co-occurrence frequency expected under a null hypothesis of independence, $R_1$ for the marginal frequency of the target term, $C_1$ for the marginal frequency of the feature term or context, and $N$ for the sample size of the underlying corpus. Evert (2008) explains in detail how these values are computed for different types of co-occurrence.

frequency (default)

Co-occurrence frequency: $$ O_{11} $$ Use this association measure to operate on raw, unweighted co-occurrence frequency data.

MI

(Pointwise) Mutual Information, a log-transformed version of the ratio between observed and expected co-occurrence frequency: $$ \log_2 \frac{O_{11}}{E_{11}} $$ Pointwise MI has a very strong bias towards pairs with low expected co-occurrence frequency (because of $E_{11}$ in the denominator). It should only be applied if low-frequency targets and features have been removed from the DSM.

The sparse version of MI (with negative scores cut off at 0) is sometimes referred to as "positive pointwise Mutual Information" (PPMI) in the literature.

simple-ll

Simple log-likelihood (Evert 2008, p. 1225): $$ \pm 2 \left( O_{11}\cdot \log \frac{O_{11}}{E_{11}} - (O_{11} - E_{11}) \right) $$ This measure provides a good approximation to the full log-likelihood measure (Evert 2008, p. 1235), but can be computed much more efficiently. It is also very similar to the local-MI measure used by several popular DSMs. The implementation used here computes signed association scores, which are negative iff $O_{11} < E_{11}$.

Log-likelihood has a strong bias towards high co-occurrence frequency and often produces a highly skewed distribution of scores. It may therefore be advisable to combine it with an additional log transformation.

t-score

The t-score association measure, which is popular for collocation identification in computational lexicography: $$ \frac{O_{11} - E_{11}}{\sqrt{O_{11}}} $$ T-score is known to filter out low-frequency data effectively.

z-score

The z-score association measure, based on a normal approximation to the binomial distribution of co-occurrence by chance: $$ \frac{O_{11} - E_{11}}{\sqrt{E_{11}}} $$ Z-score has a strong bias towards pairs with low expected co-occurrence frequency (because of $E_{11}$ in the denominator). It should only be applied if low-frequency targets and features have been removed from the DSM.

Dice

The Dice coefficient of association, which corresponds to the harmonic mean of the conditional probabilities $P(\mathrm{feature}|\mathrm{target})$ and $P(\mathrm{target}|\mathrm{feature})$: $$ \frac{2 O_{11}}{R_1 + C_1} $$ Note that Dice is inherently sparse: it preserves zeroes and does not produce negative scores.

The following additional scoring functions can be selected:

tf.idf: The tf-idf weighting scheme popular in Information Retrieval: $$ O_{11}\cdot \log \frac{1}{\mathit{df}} $$ where $\mathit{df}$ is the relative document frequency of the corresponding feature term and should be provided as a variable df in the model's column information. Otherwise, it is approximated by the feature's nonzero count $n_p$ (variable nnzero) divided by the number $K$ of rows in the co-occurrence matrix: $$ \mathit{df} = \frac{n_p + 1}{K + 1} $$ The discounting avoids division-by-zero errors when $n_p = 0$.
reweight: Apply scale transformation, column scaling and/or row normalization to previously computed feature scores (from model$S). This is the only score that can be used with a DSM that does not contain raw co-occurrence frequency data.

Sparse association scores

If sparse=TRUE, negative association scores are cut off at 0 in order to (i) ensure that the scored matrix is non-negative and (ii) preserve sparseness. The implementation assumes that association scores are always $\leq 0$ for $O_{11} = 0$ in this case and only computes scores for nonzero entries in a sparse matrix. All built-in association measures satisfy this criterion.

Other researchers sometimes refer to such sparse scores as "positive" measures, most notably positive point-wise Mutual Information (PPMI). Since sparse=TRUE is the default setting, score="MI" actually computes the PPMI measure.

Scale transformations

Association scores can be re-scaled with an isotonic transformation function that preserves sign and ranking of the scores. This is often done in order to de-skew the distribution of scores or as an approximate binarization (presence vs. absence of features). The following built-in transformations are available:

none (default): A linear transformation leaves association scores unchanged. $$ f(x) = x $$
log: The logarithmic transformation has a strong de-skewing effect. In order to preserve sparseness and sign of association scores, a signed and discounted version has been implemented. $$ f(x) = \mathop{\mathrm{sgn}}(x) \cdot \log (|x| + 1) $$
root: The signed square root transformation has a mild de-skewing effect. $$ f(x) = \mathop{\mathrm{sgn}}(x) \cdot \sqrt{|x|} $$
sigmoid: The sigmoid transformation produces a smooth binarization where negative values saturate at $-1$, positive values saturate at $+1$ and zeroes remain unchanged. $$ f(x) = \tanh x $$

References

More information about assocation measures and the notation for contingency tables can be found at http://www.collocations.de/ and in

Evert, Stefan (2008). Corpora and collocations. In A. L<U+00FC>deling and M. Kyt<U+00F6> (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212--1248. Mouton de Gruyter, Berlin, New York.

Examples

Run this code

# NOT RUN {
model <- DSM_TermTerm
model$M # raw co-occurrence matrix
  
model <- dsm.score(model, score="MI")
round(model$S, 3) # PPMI scores
  
model <- dsm.score(model, score="reweight", transform="sigmoid")
round(model$S, 3) # additional sigmoid transformation
  
# }
# NOT RUN {
# visualization of the scale transformations implemented by dsm.score
x <- seq(-2, 4, .025)
plot(x, x, type="l", lwd=2, xaxs="i", yaxs="i", xlab="x", ylab="f(x)")
abline(h=0, lwd=0.5); abline(v=0, lwd=0.5)
lines(x, sign(x) * log(abs(x) + 1), lwd=2, col=2)
lines(x, sign(x) * sqrt(abs(x)), lwd=2, col=3)
lines(x, tanh(x), lwd=2, col=4)
legend("topleft", inset=.05, bg="white", lwd=3, col=1:4,
       legend=c("none", "log", "root", "sigmoid"))
# }

Run the code above in your browser using DataLab