corSparse: Pearson correlation between columns (sparse matrices)

Description

This function computes the product-moment correlation coefficients between the columns of sparse matrices. Performance-wise, this improves over the approach taken in the cor function. However, because the resulting matrix is not-sparse, this function still cannot be used with very large matrices.

Usage

corSparse(X, Y = NULL, cov = FALSE)

Value

The result is a regular square (non-sparse!) Matrix with the Pearson product-moment correlation coefficients between the columns of X.

When Y is specified, the result is a rectangular (non-sparse!) Matrix of size nrow(X) by nrow(Y) with the correlation coefficients between the columns of X and Y.

When cov = T, the result is a covariance matrix (i.e. a non-normalized correlation).

Arguments

X: a sparse matrix in a format of the Matrix package, typically dgCMatrix . The correlations will be calculated between the columns of this matrix.
Y: a second matrix in a format of the Matrix package. When Y = NULL, then the correlations between the columns of X and itself will be taken. If Y is specified, the association between the columns of X and the columns of Y will be calculated.
cov: when TRUE the covariance matrix is returned, instead of the default correlation matrix.

Author

Michael Cysouw

Slightly extended and optimized, based on the code from a discussion at stackoverflow.

Details

To compute the covariance matrix, the code uses the principle that $$E[(X - \mu(X))' (Y - \mu(Y))] = E[X' Y] - \mu(X') \mu(Y)$$ With sample correction n/(n-1) this leads to the covariance between X and Y as $$( X' Y - n * \mu(X') \mu(Y) ) / (n-1)$$

The computation of the standard deviation (to turn covariance into correlation) is trivial in the case Y = NULL, as they are found on the diagonal of the covariance matrix. In the case Y != NULL uses the principle that $$E[X - \mu(X)]^2 = E[X^2] - \mu(X)^2$$ With sample correction n/(n-1) this leads to $$sd^2 = ( X^2 - n * \mu(X)^2 ) / (n-1)$$

Examples

Run this code


# reasonably fast (though not instantly!) with
# sparse matrices 1e4x1e4 up to a resulting matrix size of 1e8 cells.
# However, the calculations and the resulting matrix take up lots of memory

X <- rSparseMatrix(1e3, 1e3, 1e4)
system.time(M <- corSparse(X))
print(object.size(M), units = "auto") # more than 750 Mb

# Most values are low, so it often makes sense 
# to remove low values to keep results sparse

M <- drop0(M, tol = 0.4)
print(object.size(M), units = "auto") # normally reduces size to about a quarter
length(M@x) / prod(dim(M)) # down to less than 0.01% non-zero entries

# \donttest{
# comparison with other methods
# corSparse is much faster than cor from the stats package
# but cosSparse is even quicker than both!
# do not try the regular cor-method with larger matrices than 1e3x1e3
X <- rSparseMatrix(1e3, 1e3, 1e4)
X2 <- as.matrix(X)

# if there is a warning, try again with different random X
system.time(McorRegular <- cor(X2)) 
system.time(McorSparse <- corSparse(X))
system.time(McosSparse <- cosSparse(X))

# cor and corSparse give identical results
all.equal(McorSparse, McorRegular)

# corSparse and cosSparse are not identical, but close
McosSparse <- as.matrix(McosSparse)
dimnames(McosSparse) <- NULL
all.equal(McorSparse, McosSparse) 

# Actually, cosSparse and corSparse are *almost* identical!
cor(as.dist(McorSparse), as.dist(McosSparse))

# So: consider using cosSparse instead of cor or corSparse.
# With sparse matrices, this gives mostly the same results, 
# but much larger matrices are possible
# and the computations are quicker and more sparse
# }

Run the code above in your browser using DataLab