trendVar: Get the biological variability

Description

Compute the biological and technical components of the gene-specific variance in single-cell RNA-seq data.

Usage

"trendVar"(x, trend=c("poly", "loess"), df=5, span=0.3, design=NULL)
"trendVar"(x, ..., assay="exprs", use.spikes=TRUE)

Arguments

A numeric matrix of normalized expression values, where each column corresponds to a cell and each row corresponds to a spike-in transcript. Alternatively, a SCESet object that contains such values.

trend

A string indicating whether the trend should be polynomial or loess-based.

An integer scalar specifying the degrees of freedom for polynomial fitting.

span

An numeric scalar specifying the span for loess fitting.

design

A numeric matrix describing the systematic factors contributing to expression in each cell.

...

Additional arguments to pass to trendVar,matrix,list-method.

assay

A string specifying which assay values to use, e.g., counts or exprs.

use.spikes

A logical scalar specifying whether the trend should be fitted to variances for spike-in or endogenous genes.

Value

A named list is returned, containing:

mean:: A numeric vector of mean log-CPMs for all spike-in genes.
var:: A numeric vector of the variances of log-CPMs for all spike-in genes.
trend:: A function that returns the fitted value of the trend at any mean log-CPM.
design:: A numeric matrix, containing the design matrix that was used.

Additional notes

By default, a polynomial trend with df degrees of freedom is fitted to the spike-in variances as it is more precise than the loess curve. Note that this method is rather dependent on the quality of the spike-ins -- the fit will obviously be poor if the coverage of all spike-ins is low. In some data sets, a loess curve may yield better results, though this may require some fiddling with the span. When spike-ins are not available, trendVar can also be applied directly to the counts for endogenous genes by setting use.spikes=FALSE (or by manually supplying a matrix of normalized expression for endogenous genes, for trendVar,matrix-method). This assumes that most genes exhibit technical variation and little biological variation, e.g., in a homogeneous population. A loess curve is recommended here as it is more robust to a subset of genes with very large or very small variances. If use.spikes=NA, every row will be used for trend fitting, regardless of whether it corresponds to a spike-in transcript or endogenous gene.

Details

The strategy is to fit an abundance-dependent trend to the variance of the log-normalized expression for the spike-in genes, using trendVar. For SCESet objects, these expression values can be computed by normalize after setting the size factors, e.g., with computeSpikeFactors. Log-transformed values are used as these tend to be more robust to genes with strong expression in only one or two outlier cells.

The mean and variance of the log-CPMs is calculated for each spike-in gene. A polynomial or loess trend is then fitted to the variance against the mean for all genes. This represents technical variability due to sequencing, drop-outs during capture, etc. Variance decomposition to biological and technical components for endogenous genes can then be performed

The design matrix can be set if there are factors that should be blocked, e.g., batch effects, known (and uninteresting) clusters. Otherwise, it will default to an all-ones matrix, effectively treating all cells as part of the same group.

Examples

Run this code

set.seed(100)

nspikes <- ncells <- 100
spike.means <- 2^runif(nspikes, 3, 8)
spike.disp <- 100/spike.means + 0.5
spike.data <- matrix(rnbinom(nspikes*ncells, mu=spike.means, size=1/spike.disp), ncol=ncells)

ngenes <- 10000
cell.means <- 2^runif(ngenes, 2, 10)
cell.disp <- 100/cell.means + 0.5
cell.data <- matrix(rnbinom(ngenes*ncells, mu=cell.means, size=1/cell.disp), ncol=ncells)

combined <- rbind(cell.data, spike.data)
colnames(combined) <- seq_len(ncells)
rownames(combined) <- seq_len(nrow(combined))
y <- newSCESet(countData=combined)
isSpike(y) <- rep(c(FALSE, TRUE), c(ngenes, nspikes))

# Normalizing.
y <- computeSpikeFactors(y) # or computeSumFactors.
y <- normalize(y)

# Fitting a trend to the spike-ins.
fit <- trendVar(y)
plot(fit$mean, fit$var)
x <- sort(fit$mean)
lines(x, fit$trend(x), col="red", lwd=2)

# Fitting a trend to the endogenous genes. 
fit <- trendVar(y, use.spikes=FALSE)
plot(fit$mean, fit$var)
x <- sort(fit$mean)
lines(x, fit$trend(x), col="red", lwd=2)

Run the code above in your browser using DataLab