calculateQCMetrics: Calculate QC metrics

Description

Calculate QC metrics

Usage

calculateQCMetrics(object, feature_controls = NULL, technical_feature_controls = NULL, biological_feature_controls = NULL, cell_controls = NULL, nmads = 5, pct_feature_controls_threshold = 80)

Arguments

object

an SCESet object containing expression values and experimental information. Must have been appropriately prepared.

feature_controls

a character vector of feature names, or a logical vector, or a numeric vector of indices used to identify feature controls (for example, ERCC spike-in genes, mitochondrial genes, etc). Treated as technical feature controls and overridden if an argument to technical_feature_controls is provided.

technical_feature_controls

a character vector of feature names, or a logical vector, or a numeric vector of indices used to identify technical feature controls (for example, ERCC spike-in genes). Overrides feature_controls if both arguments are provided.

biological_feature_controls

a character vector of feature names, or a logical vector, or a numeric vector of indices used to identify technical feature controls (for example, mitochondrial genes)

cell_controls

a character vector of cell (sample) names, or a logical vector, or a numeric vector of indices used to identify cell controls (for example, blank wells or bulk controls).

nmads

numeric scalar giving the number of median absolute deviations to be used to flag potentially problematic cells based on total_counts (total number of counts for the cell, or library size) and total_features (number of features with non-zero expression). For total_features, cells are flagged for filtering only if total_features is nmads below the median. Default value is 5.

pct_feature_controls_threshold

numeric scalar giving a threshold for percentage of expression values accounted for by feature controls. Used as to flag cells that may be filtered based on high percentage of expression from feature controls.

Value

an SCESet object

Details

Calculate useful quality control metrics to help with pre-processing of data and identification of potentially problematic features and cells.

The following QC metrics are computed:

total_counts:: Total number of counts for the cell (aka ``library size'')

log10_total_counts:

Total counts on the log10-scale

total_features:

The number of endogenous features (i.e. not control features) for the cell that have expression above the detection limit (default detection limit is zero)

filter_on_depth:

Would this cell be filtered out based on its log10-depth being (by default) more than 5 median absolute deviations from the median log10-depth for the dataset?

filter_on_coverage:

Would this cell be filtered out based on its coverage being (by default) more than 5 median absolute deviations from the median coverage for the dataset?

filter_on_pct_counts_feature_controls:

Should the cell be filtered out on the basis of having a high percentage of counts assigned to control features? Default threshold is 80 percent (i.e. cells with more than 80 percent of counts assigned to feature controls are flagged).

counts_feature_controls:

Total number of counts for the cell that come from (one or more sets of user-defined) control features. Defaults to zero if no control features are indicated. If more than one set of feature controls are defined (for example, ERCC and MT genes are defined as controls), then this metric is produced for all sets, plus the union of all sets (so here, we get columns counts_feature_controls_ERCC, counts_feature_controls_MT and counts_feature_controls).

log10_counts_feature_controls:

Just as above, the total number of counts from feature controls, but on the log10-scale. Defaults to zero (i.e.~log10(0 + 1), offset to avoid negative infinite values) if no feature control are indicated.

pct_counts_feature_controls:

Just as for the counts described above, but expressed as a percentage of the total counts. Defined for all control sets and their union, just like the raw counts. Defaults to zero if no feature controls are defined.

filter_on_pct_counts_feature_controls:

Would this cell be filtered out on the basis that the percentage of counts from feature controls is higher than a defined threhold (default is 80%)? Just as with counts_feature_controls, this is defined for all control sets and their union.

pct_counts_top_50_features:

What percentage of the total counts is accounted for by the 50 highest-count features? Also computed for the top 100 and top 200 features, with the obvious changes to the column names.

pct_dropout:

Percentage of features that are not ``detectably expressed'', i.e. have expression below the lowerDetectionLimit threshold.

counts_endogenous_features:

Total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to `depth` if no control features are indicated.

log10_counts_endogenous_features:

Total number of counts from endogenous features on the log10-scale. Defaults to all counts if no control features are indicated.

n_detected_feature_controls:

Number of defined feature controls that have expression greater than the threshold defined in the object (that is, they are ``detectably expressed''; see object@lowerDetectionLimit to check the threshold). As with other metrics for feature controls, defined for all sets of feature controls (set names appended as above) and their union. So we might commonly get columns n_detected_feature_controls_ERCC, n_detected_feature_controls_MT and n_detected_feature_controls (ERCC and MT genes detected).

is_cell_control:

Has the cell been defined as a cell control? If more than one set of cell controls are defined (for example, blanks and bulk libraries are defined as cell controls), then this metric is produced for all sets, plus the union of all sets (so we could typically get columns is_cell_control_Blank, is_cell_control_Bulk, and is_cell_control, the latter including both blanks and bulks as cell controls).

These cell-level QC metrics are added as columns to the ``phenotypeData'' slot of the SCESet object so that they can be inspected and are readily available for other functions to use. Furthermore, wherever ``counts'' appear in the above metrics, the same metrics will also be computed for ``exprs'', ``tpm'' and ``fpkm'' values (if TPM and FPKM values are present in the SCESet object), with the appropriate term replacing ``counts'' in the name. The following feature-level QC metrics are also computed:

mean_exprs:: The mean expression level of the gene/feature.

exprs_rank:

The rank of the feature's mean expression level in the cell.

n_cells_exprs:

The number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).

total_feature_counts:

The total number of counts assigned to that feature across all cells.

log10_total_feature_counts:

Total feature counts on the log10-scale.

pct_total_counts:

The percentage of all counts that are accounted for by the counts assigned to the feature.

pct_dropout:

The percentage of all cells that have no detectable expression (i.e. is_exprs(object) is FALSE) for the feature.

is_feature_control:

Is the feature a control feature? Default is `FALSE` unless control features are defined by the user. If more than one feature control set is defined (as above), then a column of this type is produced for each control set (e.g. here, is_feature_control_ERCC and is_feature_control_MT) as well as the column named is_feature_control, which indicates if the feature belongs to any of the control sets.

These feature-level QC metrics are added as columns to the ``featureData'' slot of the SCESet object so that they can be inspected and are readily available for other functions to use. As with the cell-level metrics, wherever ``counts'' appear in the above, the same metrics will also be computed for ``exprs'', ``tpm'' and ``fpkm'' values (if TPM and FPKM values are present in the SCESet object), with the appropriate term replacing ``counts'' in the name.

Examples

Run this code

data("sc_example_counts")
data("sc_example_cell_info")
pd <- new("AnnotatedDataFrame", data=sc_example_cell_info)
rownames(pd) <- pd$Cell
example_sceset <- newSCESet(countData=sc_example_counts, phenoData=pd)
example_sceset <- calculateQCMetrics(example_sceset)

## with a set of feature controls define
example_sceset <- calculateQCMetrics(example_sceset, feature_controls = 1:40)

## with both technical and biological feature controls
example_sceset <- calculateQCMetrics(example_sceset, 
technical_feature_controls = list(ERCC = 1:40), 
biological_feature_controls = list(MT = 50:100))

Run the code above in your browser using DataLab