When analyzing many subjects (ie. 100,000 or more) with many variables (i.e. 1000 or more) core R can take a long time and sometime exceed memory limits (i.e. with 600K subjects and 6K variables). bigCor runs (in parallel if multicores are available) by breaking the variables into subsets (of size=size), finding all subset correlations, and then stitches the resulting matrices into one large matrix. Noticeable improvements in speed compared to cor.
bigCor(x, size = NULL, use = "pairwise",cor="pearson",correct=.5)
The correlation matrix
A data set of numeric variables
What should the size of the subsets be? Defaults to NCOL (x)/20
The standard correlation option. "pairwise" allows for missing data
Defaults to Pearson correlations, alteratives are polychoric and spearman
Correction for continuity for polychoric correlations. (see polychoric
)
William Revelle
The data are divided into subsets of size=size. Correlations are then found for each subset and pairs of subsets.
Time is roughly linear with the number of cases and increases by the square of the number of variables. The benefit of more cores is noticeable. It seems as if with 4 cores, we should use sizes to split it into 8 or 12 sets. Otherwise we don't actually use all cores efficiently.
There is some overhead in using multicores. So for smaller problems (e.g. the 4,000 cases of the 145 items of the psychTools::spi data set, the timings are roughly .14 seconds for bigCor (default size) and .10 for normal cor. For small problems, this actually gets worse as we use more cores. The cross over point seems to be at roughly 5K subjects. (updated these timings to recognize the M1 Max chip. An increase of 4x in speed! They had been .44 and .36.)
The basic loop loops over the subsets. When the size is a integer subset of the number of variables and is a multiple of the number of cores, the multiple cores will be used more. Notice the benefit of 660/80 versus 660/100. But this breaks down if we try 660/165. Further notice the benefit when using a smaller subset (55) which led to the 4 cores being used more.
The following timings are included to help users tinker with parameters:
Timings (in seconds) for various problems with 645K subjects on an 8 core Mac Book Pro with a 2.4 GHZ Intell core i9.
options(mc.cores=4) (Because we have 8 we can work at the same time as we test this.)
First test it with 644,495 subjects and 1/10 of the number of possible variables. Then test it for somewhat fewer variables.
Variables | size | 2 cores | 4 cores | compared to normal cor function |
660 | 100 | 430 | 434 | 430 |
660 | 80 | 600 | 348 | notice the improvement with 8ths |
660 | 165 | 666 | (Stitching seems to have been very slow) | |
660 | 55 | 303 | Even better if we break it into 12ths! | |
500 | 100 | 332 | ||
322 secs | 480 | 120 | 408 | |
365 | 315 Better to change the size | 480 | 60 | 358 |
We also test it with fewer subjects. Time is roughly linear with number of subjects.
Variables | size | 2 cores | 4 cores | compared to normal cor function Further comparisons with fewer subjects (100K) |
480 | 60 | 57 | 31 | |
47 with normal cor. Note the effect of n subjects! | 200 | 50 | 19.9 | 13.6 |
27.13 | 100 | 25 | 4.6 | 3.5 |
One last comparison, 10,000 subjects, showing the effect of getting the proper size value. You can tune on these smaller sets of subjects before trying large problems.
Variables | size | 2 cores | 4 cores | compared to normal cor function |
480 | 120 | 5.2 | 5.1 | 4.51 |
480 | 60 | 2.9 | 2.88 | 4.51 |
480 | 30 | 2.65 | 2.691 | 480 |
20 | 2.73 | 2.77 | 480 | |
10 | 2.82 | 2.97 | too many splits? | 200 |
50 | 2.18 | 1.39 | 2.47 for normal cor (1.44 with 8 cores 2.99 with 1 core) | 200 |
25 | 1.2 | 1.17 | 2.47 for normal cor | (1.16 with 8 cores, 1.17 with 1 core) |
100 | 25 | .64 | .52 | .56 |
Timings updated in 2/23 using a MacBook Pro with M1 max chip 10,000 subjects 953 variables suggests that a very small size (e.g. 20) is probably optimal
Variables | size | 2 cores | 4 cores | 8 cores | compared to normal cor function | 953 |
20 | 7.92 | 4.55 | 2.88 | 11.04 | 953 | 30 |
7.98 | 4.88 | 3.15 | 11.04 | 953 | 40 | 8.22 |
5.14 | 3.63 | 11.16 | 953 | 60 | 8.51 | 5.59 |
3.93 | 11.16 | 953 | 80 | 8.31 | 5.59 | 4.14 |
11.16 | 953 | 120 | 8.33 | 6.22 | 4.75 | 11.16 |
Examples of large data sets with massively missing data are taken from the SAPA project. e.g.,
William Revelle, Elizabeth M. Dworak, and David M. Condon (2021) Exploring the persome: The power of the item in understanding personality structure. Personality and Individual Differences, 169, tools:::Rd_expr_doi("10.1016/j.paid.2020.109905")
David Condon (2018)The SAPA Personality Inventory: an empirically-derived, hierarchically-organized self-report personality assessment model. PsyArXiv /sc4p9/ tools:::Rd_expr_doi("10.31234/osf.io/sc4p9")
pairwiseCountBig
which will do the same, but find the count of observations per cell.
R <- bigCor(bfi,10)
#compare the results with
r.bfi <- cor(bfi,use="pairwise")
all.equal(R,r.bfi)
Run the code above in your browser using DataLab