cor: Fast calculations of Pearson correlation.

Description

These functions implements a faster calculation of Pearson correlation.

The speedup against the R's standard cor function will be substantial particularly if the input matrix only contains a small number of missing data. If there are no missing data, or the missing data are numerous, the speedup will be smaller but still present.

Usage

cor(x, y = NULL, 
    use = "all.obs", 
    method = c("pearson", "kendall", "spearman"),
    quick = 0, 
    cosine = FALSE, 
    cosineX = cosine,
    cosineY = cosine, 
    drop = FALSE,
    nThreads = 0, 
    verbose = 0, indent = 0)
corFast(x, y = NULL, 
    use = "all.obs", 
    quick = 0, nThreads = 0, 
    verbose = 0, indent = 0)
cor1(x, use = "all.obs", verbose = 0, indent = 0)

Arguments

a numeric vector or a matrix. If y is null, x must be a matrix.

a numeric vector or a matrix. If not given, correlations of columns of x will be calculated.

use

a character string specifying the handling of missing data. The fast calculations currently support "all.obs" and "pairwise.complete.obs"; for other options, see R's standard correlation function

method

a character string specifying the method to be used. Fast calculations are currently available only for "pearson".

quick

real number between 0 and 1 that controls the precision of handling of missing data in the calculation of correlations. See details.

cosine

logical: calculate cosine correlation? Only valid for method="pearson". Cosine correlation is similar to Pearson correlation but the mean subtraction is not performed. The result is the cosine of the angle(s) between (the columns of) x<

cosineX

logical: use the cosine calculation for x? This setting does not affect y and can be used to give a hybrid cosine-standard correlation.

cosineY

logical: use the cosine calculation for y? This setting does not affect x and can be used to give a hybrid cosine-standard correlation.

drop

logical: should the result be turned into a vector if it is effectively one-dimensional?

nThreads

non-negative integer specifying the number of parallel threads to be used by certain parts of correlation calculations. This option only has an effect on systems on which a POSIX thread library is available (which currently includes Linux and Mac OSX, but

verbose

Controls the level of verbosity. Values above zero will cause a small amount of diagnostic messages to be printed.

indent

Indentation of printed diagnostic messages. Each unit above zero adds two spaces.

Value

The matrix of the Pearson correlations of the columns of x with columns of y if y is given, and the correlations of the columns of x if y is not given.

Details

The fast calculations are currently implemented only for method="pearson" and use either "all.obs" or "pairwise.complete.obs". The corFast function is a wrapper that calls the function cor. If the combination of method and use is implemented by the fast calculations, the fast code is executed; otherwise, R's own correlation cor is executed.

The argument quick specifies the precision of handling of missing data. Zero will cause all calculations to be executed precisely, which may be significantly slower than calculations without missing data. Progressively higher values will speed up the calculations but introduce progressively larger errors. Without missing data, all column means and variances can be pre-calculated before the covariances are calculated. When missing data are present, exact calculations require the column means and variances to be calculated for each covariance. The approximate calculation uses the pre-calculated mean and variance and simply ignores missing data in the covariance calculation. If the number of missing data is high, the pre-calculated means and variances may be very different from the actual ones, thus potentially introducing large errors. The quick value times the number of rows specifies the maximum difference in the number of missing entries for mean and variance calculations on the one hand and covariance on the other hand that will be tolerated before a recalculation is triggered. The hope is that if only a few missing data are treated approximately, the error introduced will be small but the potential speedup can be significant.

References

Peter Langfelder, Steve Horvath (2012) Fast R Functions for Robust Correlations and Hierarchical Clustering. Journal of Statistical Software, 46(11), 1-17. http://www.jstatsoft.org/v46/i11/

Examples

Run this code

## Test the speedup compared to standard function cor

# Generate a random matrix with 200 rows and 1000 columns

set.seed(10)
nrow = 100;
ncol = 500;
data = matrix(rnorm(nrow*ncol), nrow, ncol);

## First test: no missing data

system.time( {corStd = stats::cor(data)} );

system.time( {corFast = cor(data)} );

all.equal(corStd, corFast)

# Here R's standard correlation performs very well.

# We now add a few missing entries.

data[sample(nrow, 10), 1] = NA;

# And test the correlations again...

system.time( {corStd = stats::cor(data, use ='p')} );

system.time( {corFast = cor(data, use = 'p')} );

all.equal(corStd, corFast)

# Here the R's standard correlation slows down considerably
# while corFast still retains it speed. Choosing
# higher ncol above will make the difference more pronounced.

Run the code above in your browser using DataLab