Sparse alternative to base dist
function. WARNING: the result is not a distance metric, see details! Also: distances are calculated between columns (not between rows, as in the base dist
function).
distSparse(M, method = "euclidean", diag = FALSE)
A symmetric matrix of type dsCMatrix
, consisting of similarity(!) values instead of distances (viz. max(dist)-dist
).
a sparse matrix in a format of the Matrix
package, typically dMatrix
. Any other matrices will be converted to such a sparse Matrix. The correlations will be calculated between the columns of this matrix (different from the base dist
function!)
method to calculate distances. Currently only "euclidean"
is supported.
should the diagonal be included in the results?
Michael Cysouw <cysouw@mac.com
A sparse distance matrix is a slightly awkward concept, because distances of zero are rare in most data. Further, it is mostly the small distances that are of interest, and not the large distanes (which are mostly also less trustwhorthy). Note that for random data, this assumption is not necessarily true.
To obtain sparse results, the current implementation takes a special approach. First, only those distances will be calculated for which there is at least some non-zero data for both columns. The assumption is taken that those distances will be uninteresting (and relatively large anyway).
Second, to differentiate the non-calculated distances from real zero distances, the distances are converted into similarities by substracting them from the maximum. In this way, all non-calculated distances are zero, and the real zeros have value max(M)
.
Euclidean distances are calculated using the following trick: $$colSums(M^2) + rowSums(M^2) - 2 * M'M$$
See Also as dist
.