correlateNull(ncells, iters=1e6, design=NULL) ## S3 method for class 'matrix':
correlatePairs(x, null.dist=NULL, design=NULL, BPPARAM=bpparam(),
use.names=TRUE, tol=1e-8)
## S3 method for class 'SCESet':
correlatePairs(x, ..., assay="exprs", get.spikes=FALSE)
bplapply
for parallel processing.exprs
should be used in the output.
Alternatively, a character vector containing the names to use.correlatePairs,matrix-method
.correlateNull
, a numeric vector of length iters
is returned containing the sorted correlations under the null hypothesis of no correlations.For correlatePairs
, a dataframe is returned with one row per gene pair and the following fields:
[object Object],[object Object],[object Object]
Rows are sorted by increasing p.value
and, if tied, decreasing absolute size of rho
.
x
with some care.
Using a top set of 100-200 highly variable genes (HVGs) is recommended.
This will focus on genes contributing to cell-to-cell heterogeneity (and thus more likely to be involved in driving substructure).
There is no need to account for HVG pre-selection in multiple testing, because rank correlations are unaffected by the variance.
For more genes, set BPPARAM
to use more workers and reduce computational time.If the experiment has known (and uninteresting) factors of variation, e.g., batch effects, cell cycle phase, these can be included in the design
.
Correlations between genes will then be computed using the residual effects of a linear model fitted to the normalized expression values with design
.
Similarly, the null distribution of rho values will be constructed with ncells
set as the residual degrees of freedom in design
.
This procedure ensures that the uninteresting factors do not drive strong correlations between genes.
Note that tied counts may not lead to tied effects, especially for design matrices with real-valued covariates.
As a result, large correlations may be obtained for genes with few non-zero counts.
Some protection is provided by focusing on HVGs, because genes dominated by zeroes will usually have low variance.
For example, counts of zero will have the same normalized log-expression, even if the library sizes are different.
This is because the added prior count is scaled by the library size in cpm.default
, such that the effect of library size cancels out.
Thus, all zeroes will have tied ranks (with numerical imprecision handled by tol
) and will not inflate the correlations.
For non-zero counts, correlations may be driven by library size differences between cells.
This is, perhaps, less problematic, as two genes with the same counts in both small and large libraries provide evidence for association.
correlatePairs
function is to identify significant correlations between all pairs of genes in x
.
This allows prioritization of genes that are driving systematic substructure in the data set.
By definition, such genes should be correlated as they are behaving in the same manner across cells.
In contrast, genes driven by random noise should not exhibit any correlations with other genes.An approximation of Spearman's rho is used to quantify correlations robustly based on ranks. To identify correlated gene pairs, the significance of non-zero correlations is assessed using a permutation test. The null hypothesis is that the (ranking of) normalized expression across cells should be independent between genes. This allows us to construct a null distribution by randomizing (ranked) expression within each gene.
The correlateNull
function constructs an empirical null distribution for rho computed with ncells
cells.
This is done by shuffling the ranks, calculating the rho and repeating until iters
values are obtained.
The p-value for each gene pair is defined as the tail probability of this distribution at the observed correlation (with some adjustment to avoid zero p-values).
Correction for multiple testing is done using the BH method.
For correlatePairs
, a pre-computed empirical distribution can be supplied as null.dist
if available.
Otherwise, it will be automatically constructed via correlateNull
with ncells
set to the number of columns in exprs
.
For correlatePairs,SCESet-method
, correlations should be computed for normalized expression values in the specified assay
.
By default, rows corresponding to spike-in transcripts are removed with get.spikes=FALSE
.
This avoids picking up strong technical correlations between pairs of spike-in transcripts.
bpparam
,
cor
set.seed(0)
ncells <- 100
null.dist <- correlateNull(ncells, iters=100000)
exprs <- matrix(rpois(ncells*100, lambda=10), ncol=ncells)
out <- correlatePairs(exprs, null.dist=null.dist)
hist(out$p.value)
Run the code above in your browser using DataLab