autoEstCont: Automatically calculate the contamination fraction

Description

The idea of this method is that genes that are highly expressed in the soup and are marker genes for some population can be used to estimate the background contamination. Marker genes are identified using the tfidf method (see quickMarkers). The contamination fraction is then calculated at the cluster level for each of these genes and clusters are then aggressively pruned to remove those that give implausible estimates.

Usage

autoEstCont(
  sc,
  topMarkers = NULL,
  tfidfMin = 1,
  soupQuantile = 0.9,
  maxMarkers = 100,
  contaminationRange = c(0.01, 0.8),
  rhoMaxFDR = 0.2,
  priorRho = 0.05,
  priorRhoStdDev = 0.1,
  doPlot = TRUE,
  forceAccept = FALSE,
  verbose = TRUE
)

Value

A modified SoupChannel object where the global contamination rate has been set. Information about the estimation is also stored in the slot fit

Arguments

sc: The SoupChannel object.
topMarkers: A data.frame giving marker genes. Must be sorted by decreasing specificity of marker and include a column 'gene' that contains the gene name. If set to NULL, markers are estimated using quickMarkers.
tfidfMin: Minimum value of tfidf to accept for a marker gene.
soupQuantile: Only use genes that are at or above this expression quantile in the soup. This prevents inaccurate estimates due to using genes with poorly constrained contribution to the background.
maxMarkers: If we have heaps of good markers, keep only the best maxMarkers of them.
contaminationRange: Vector of length 2 that constrains the contamination fraction to lie within this range. Must be between 0 and 1. The high end of this range is passed to estimateNonExpressingCells as maximumContamination.
rhoMaxFDR: False discovery rate passed to estimateNonExpressingCells, to test if rho is less than maximumContamination.
priorRho: Mode of gamma distribution prior on contamination fraction.
priorRhoStdDev: Standard deviation of gamma distribution prior on contamination fraction.
doPlot: Create a plot showing the density of estimates?
forceAccept: Passed to setContaminationFraction. Should we allow very high contamination fractions to be used.
verbose: Be verbose?

Details

This set of marker genes is filtered to include only those with tf-idf value greater than tfidfMin. A higher tf-idf value implies a more specific marker. Specifically a cut-off t implies that a marker gene has the property that geneFreqGlobal < exp(-t/geneFreqInClust). See quickMarkers. It may be necessary to decrease this value for data sets with few good markers.

This set of marker genes is filtered down to include only the genes that are highly expressed in the soup, controlled by the soupQuantile parameter. Genes highly expressed in the soup provide a more precise estimate of the contamination fraction.

The pruning of implausible clusters is based on a call to estimateNonExpressingCells. The parameters maximumContamination=max(contaminationRange) and rhoMaxFDR are passed to this function. The defaults set here are calibrated to aggressively prune anything that has even the weakest of evidence that it is genuinely expressed.

For each cluster/gene pair the posterior distribution of the contamination fraction is calculated (based on gamma prior, controlled by priorRho and priorRhoStdDev). These posterior distributions are aggregated to produce a final estimate of the contamination fraction. The logic behind this is that estimates from clusters that truly estimate the contamination fraction will cluster around the true value, while erroneous estimates will be spread out across the range (0,1) without a 'preferred value'. The most probable value of the contamination fraction is then taken as the final global contamination fraction.

Examples

Run this code

#Use less specific markers
scToy = autoEstCont(scToy,tfidfMin=0.8)
#Allow large contamination fractions to be allocated
scToy = autoEstCont(scToy,forceAccept=TRUE)
#Be quiet
scToy = autoEstCont(scToy,verbose=FALSE,doPlot=FALSE)