dist.matrix: Distances/Similarities between Row or Column Vectors (wordspace)

Description

Compute a symmetric matrix of distances (or similarities) between the rows or columns of a matrix; or compute cross-distances between the rows or columns of two different matrices. This implementation is faster than dist and can operate on sparse matrices (in canonical DSM format).

Usage

dist.matrix(M, M2 = NULL, method = "cosine", p = 2, 
            normalized = FALSE, byrow = TRUE, convert = TRUE, as.dist = FALSE, 
            terms = NULL, terms2 = terms, skip.missing = FALSE)

Value

By default, a numeric matrix of class dist.matrix, specifying distances or similarities between term vectors. A similarity matrix is marked by an additional attribute similarity with value TRUE. If the distance or similarity matrix is symmetric (i.e. neither a cross-distance matrix nor based on an asymmetric distance measure), it is marked by an attribute symmetric with value TRUE.

If as.dist=TRUE, the matrix is compacted to an object of class dist.

Arguments

M: a dense or sparse matrix representing a scored DSM, or an object of class dsm
M2: an optional dense or sparse matrix representing a second scored DSM, or an object of class dsm. If present, cross-distances between the rows (or columns) of M and those of M2 will be computed.
method: distance or similarity measure to be used (see “Distance Measures” below for details)
p: exponent of the minkowski $L_p$-metric, a numeric value in the range $0 \le p < \infty$. The range $0 \le p < 1$ represents a generalization of the standard Minkowski distance, which cannot be derived from a proper mathematical norm (see details below).
normalized: if TRUE, assume that the row (or column) vectors of M and M2 have been appropriately normalised (depending on the selected distance measure) in order to speed up calculations. This option is often used with the cosine metric, for which vectors must be normalized wrt. the Euclidean norm. It is currently ignored for other distance measures.
byrow: whether to calculate distances between row vectors (default) or between column vectors (byrow=FALSE)
convert: if TRUE, similarity measures are automatically converted to distances in an appropriate way (see “Distance Measures” below for details). Note that this is the default setting and convert=FALSE has to be specified explicitly in order to obtain a similarity matrix.
as.dist: convert the full symmetric distance matrix to a compact object of class dist. This option cannot be used if cross-distances are calculated (with argument M2) or if a similarity measure has been selected (with option convert=FALSE).
terms: a character vector specifying rows of M for which distance matrix is to be computed (or columns if byrow=FALSE)
terms2: a character vector specifying rows of M2 for which the cross-distance matrix is to be computed (or columns if byrow=FALSE). If only the argument terms is specified, the same set of rows (or columns) will be selected from both M and M2; you can explicitly specify terms2=NULL in order to compute cross-distances for all rows (or columns) of M2.
skip.missing: if TRUE, silently ignores terms not found in M (or in M2). By default (skip.missing=FALSE) an error is raised in this case.

Distance Measures

Given two DSM vectors $x$ and $y$, the following distance metrics can be computed:

euclidean

The Euclidean distance given by $$ d_2(x, y) = \sqrt{ \sum_i (x_i - y_i)^2 }$$

manhattan

The Manhattan (or “city block”) distance given by $$ d_1(x, y) = \sum_i |x_i - y_i|$$

maximum

The maximum distance given by $$ d_{\infty}(x, y) = \max_i |x_i - y_i|$$

minkowski

The Minkowski distance is a family of metrics determined by a parameter $0 \le p < \infty$, which encompasses the Euclidean, Manhattan and maximum distance as special cases. Also known as $L_p$-metric, it is defined by $$ d_p(x, y) = \left( \sum_i |x_i - y_i|^p \right)^{1/p}$$ for $p \ge 1$ and by $$ d_p(x, y) = \sum_i | x_i - y_i |^p$$ for $0 \le p < 1$. In the latter case, it is not homogeneous and cannot be derived from a corresponding mathematical norm (cf. rowNorms).

Special cases include the Euclidean metric $d_2(x, y)$ for $p = 2$ and the Manhattan metric $d_1(x, y)$ for $p = 1$, but the dedicated methods above provide more efficient implementations. For $p \to \infty$, $d_p(x, y)$ converges to the maximum distance $d_{\infty}(x, y)$, which is also selected by setting p=Inf. For $p = 0$, $d_p(x, y)$ corresponds to the Hamming distance, i.e. the number of differences $$ d_0(x, y) = \#\{ i | x_i \ne y_i \}$$

canberra

The Canberra metric has been implemented for compatibility with the dist function, even though it is probably not very useful for DSM vectors. It is given by $$ \sum_i \frac{|x_i - y_i|}{|x_i| + |y_i|}$$ (see https://en.wikipedia.org/wiki/Canberra_distance). Terms with $x_i = y_i = 0$ are silently dropped from the summation.

Note that dist uses a different formula $$ \sum_i \frac{|x_i - y_i|}{|x_i + y_i|}$$ which is highly problematic unless $x$ and $y$ are guaranteed to be non-negative. Terms with $x_i = y_i = 0$ are imputed, i.e. set to the average value of all nonzero terms.

In addition, the following similarity measures can be computed and optionally converted to a distance metric (or dissimilarity):

cosine (default)

The cosine similarity given by $$ \cos \phi = \frac{x^T y}{||x||_2 \cdot ||y||_2} $$ If normalized=TRUE, the denominator is omitted. If convert=TRUE (the default), the cosine similarity is converted to angular distance $\phi$, given in degrees ranging from 0 to 180.

jaccard

The generalized Jaccard coefficient given by $$ J(x, y) = \frac{ \sum_i \min(x_i, y_i) }{ \sum_i \max(x_i, y_i) } $$ which is only defined for non-negative vectors $x$ and $y$. If convert=TRUE (the default), the Jaccard metric $1 - J(x,y)$ is returned (see Kosub 2016 for details). Note that $J(0, 0) = 1$.

overlap

An asymmetric measure of overlap given by $$ o(x, y) = \frac{ \sum_i \min(x_i, y_i) }{ \sum_i x_i } $$ for non-negative vectors $x$ and $y$. If convert=TRUE (the default), the result is converted into a dissimilarity measure $1 - o(x,y)$, which is not a metric, of course. Note that $o(0, y) = 1$ and in particular $o(0, 0) = 1$.

Overlap computes the proportion of the “mass” of $x$ that is shared with $y$; as a consequence, $o(x, y) = 1$ whenever $x \le y$. If both vectors are normalized as probability distributions ($||x||_1 = ||y||_1 = 1$) then overlap is symmetric ($o(x, y) = o(y, x)$) and can be thought of as the shared probability mass of the two distributions. In this case, normalized=TRUE can be passed in order to simplify the computation to $o(x, y) = \sum_i \min(x_i, y_i)$.

Author

Stephanie Evert (https://purl.org/stephanie.evert)

Examples

Run this code


M <- DSM_TermTermMatrix
dist.matrix(M, as.dist=TRUE)                     # angular distance
dist.matrix(M, method="euclidean", as.dist=TRUE) # Euclidean distance
dist.matrix(M, method="manhattan", as.dist=TRUE) # Manhattan distance
dist.matrix(M, method="minkowski", p=1, as.dist=TRUE)  # L_1 distance
dist.matrix(M, method="minkowski", p=99, as.dist=TRUE) # almost L_Inf
dist.matrix(M, method="maximum", as.dist=TRUE)         # L_Inf (maximum)
dist.matrix(M, method="minkowski", p=.5, as.dist=TRUE) # L_0.5 distance
dist.matrix(M, method="minkowski", p=0, as.dist=TRUE)  # Hamming distance

round(dist.matrix(M, method="cosine", convert=FALSE), 3) # cosine similarity

Run the code above in your browser using DataLab