Learn R Programming

scater: single-cell analysis toolkit for expression with R

[//]: # () [//]: # ()

This package contains useful tools for the analysis of single-cell gene expression data using the statistical software R. The package places an emphasis on tools for quality control, visualisation and pre-processing of data before further downstream analysis.

We hope that scater fills a useful niche between raw RNA-sequencing count or transcripts-per-million data and more focused downstream modelling tools such as monocle, scLVM, SCDE, edgeR, limma and so on.

Briefly, scater enables the following:

  1. Automated computation of QC metrics
  2. Transcript quantification from read data with pseudo-alignment
  3. Data format standardisation
  4. Rich visualisations for exploratory analysis
  5. Seamless integration into the Bioconductor universe
  6. Simple normalisation methods

See below for information about installation, getting started and highlights of the package.

Installation

Installation from Bioconductor (recommended)

The scater package has been accepted into Bioconductor! Thus, the most reliable way to install the package is to use the usual Bioconductor method:

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("scater")

Currently, only the "devel" (i.e. development) version of scater is available through Bioconductor. This means that you will need to be using Bioconductor devel and the development version of R (R 3.3) in order to install scater from Bioconductor.

The scater package will become available as a "release" version in the next Bioconductor release in April 2016. At this point the release version of scater will work with the release version of R and Bioconductor, and development will continue in the devel version of the package.

Installation from Github

Alternatively, scater can be installed directly from GitHub as described below. In this case, package that scater uses ("depends on" in R parlance) will not be automatically installed, so you will have to install the required packages as shown below.

I recommend using Hadley Wickham's devtools package to install scater directly from GitHub. If you don't have devtools installed, then install that from CRAN (as shown below) and then run the call to install scater:

If you are using the development version of R, 3.3:

install.packages("devtools")
devtools::install_github("davismcc/scater", build_vignettes = TRUE)

If you are using the current release version of R, 3.2.3:

devtools::install_github("davismcc/scater", ref = "release-R-3.2", build_vignettes = TRUE)

If you find that the above will not install on Linux systems, please try with the option build_vignettes = FALSE. This is a known issue that we are working to resolve.

Using the most recent version of R is strongly recommended (R 3.2.3 at the time of writing). Effort has been made to ensure the package works with R >3.0, but the package has not been tested with R <3.1.1.

There are several other packages from CRAN and Bioconductor that scater uses, so you will need to have these packages installed as well. The CRAN packages should install automatically when scater is installed, but you will need to install the Bioconductor packages manually.

Not all of the following are strictly necessary, but they enhance the functionality of scater and are good packages in their own right. The commands below should help with package installations.

CRAN packages:

install.packages(c("data.table", "ggplot2", "knitr", "matrixStats", "MASS",
                "plyr", "reshape2", "rjson", "testthat", "viridis"))

Bioconductor packages:

source("http://bioconductor.org/biocLite.R")
biocLite(c("Biobase", "biomaRt", "edgeR", "limma", "rhdf5"))

Optional packages that are not strictly required but enhance the functionality of scater:

install.packages(c("cowplot", "cluster", "mvoutlier", "parallel", "Rtsne"))
biocLite(c("destiny", "monocle"))

You might also like to install dplyr for convenient data manipulation:

install.packages("dplyr")

Getting started

The best place to start is the vignette. From inside an R session, load scater and then browse the vignettes:

library(scater)
browseVignettes("scater")

There is a detailed HTML document available that introduces the main features and functionality of scater.

scater workflow

The diagram below provised an overview of the pre-processing and QC workflow possible in scater, listing the functions that can be used at various stages.

Highlights

The scater package allows you to do some neat things relatively quickly. Some highlights are shown below with example code and screenshots.

  1. Automated computation of QC metrics
  2. Transcript quantification from read data with pseudo-alignment approaches
  3. Data format standardisation
  4. Rich visualisations for QC and exploratory analysis
  5. Seamless integration into the Bioconductor universe
  6. Simple normalisation methods

For details of how to use these functions, please consult the vignette and package documentation. The plots shown use the example data included with the package (for which there is no interesting structure) and as shown require only one or two lines of code to generate.

Automatic computation of QC metrics

Use the calculateQCMetrics function to compute many metrics useful for gene/transcript-level and cell-level QC. Metrics computed include number of genes expressed per cell, percentage of expression from control genes (e.g. ERCC spike-ins) and many more.

Transcript quantification with kallisto

The runKallisto function provides a wrapper to the kallisto software for quantifying transcript abundance from FASTQ files using a pseudo-alignment approach. This new approach is extremely fast. With readKallisto, transcript quantities can be read into a data object in R.

Plotting functions

Default plot for an SCESet object gives cumulative expression for the most-expressed features (genes or transcripts)

The plotTSNE function produces a t-distributed stochastic neighbour embedding plot for the cells.

The plotPCA function produces a principal components analysis plot for the cells.

The plotDiffusionMap function produces a diffusion map plot for the cells.

The plotExpression function plots the expression values for a selection of features.

The plotQC function produces a variety of QC plots useful for diagnostics and feature and cell filtering. It can be used to plot the most highly-expressed genes (or features) in the data set or create density plots to assess the relative importance of explanatory variables, as well as many other visualisations useful for QC.

The plotPhenoData function plots two phenotype metadata variables (such as QC metrics).

See also plotFeatureData to plot feature (gene) metadata variables, including QC metrics.

Plus many, many more possibilities. Please consult the vignette and documentation for details.

Acknowledgements and disclaimer

The package leans heavily on previously published work and packages, namely edgeR and limma. The SCESet class is inspired by the CellDataSet class from monocle, and SCESet objects in scater can be easily converted to and from monocle's CellDataSet objects.

The package is currently in an Beta state. The major functionality of the package is settled, but it is still under development so may change from time to time. Please do try it and contact me with bug reports, feedback, feature requests, questions and suggestions to improve the package.

Davis McCarthy, February 2016

Copy Link

Version

Version

1.0.4

License

GPL (>= 2)

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

February 15th, 2017

Functions in scater (1.0.4)

arrange

Arrange rows of pData(object) by variables.
SCESet-subset

Subsetting SCESet Objects
is_exprs

Accessors for the 'is_exprs' element of an SCESet object.
filter

Return SCESet with cells matching conditions.
isOutlier

Identify if a cell is an outlier based on a metric
readSalmonResults

Read Salmon results from a batch of jobs
pData<-,SCESet,AnnotatedDataFrame-method

Replaces phenoData in an SCESet object
readKallistoResultsOneSample

Read kallisto results for a single sample into a list
plotMDS

Produce a multidimensional scaling plot for an SCESet object
readKallistoResults

Read kallisto results from a batch of jobs
toCellDataSet

Convert an SCESet to a CellDataSet
runKallisto

Run kallisto on FASTQ files to quantify feature abundance
set_exprs<-

Assignment method for the new elements of an SCESet object.
tpm

Accessors for the 'tpm' (transcripts per million) element of an SCESet object.
plotMetadata

Plot metadata for cells or features
cellPairwiseDistances

cellPairwiseDistances in an SCESet object
cellNames

Get cell names from an SCESet object
plotExplanatoryVariables

Plot explanatory variables ordered by percentage of phenotypic variance explained
reducedDimension

Reduced dimension representation for cells in an SCESet object
cpm

Accessors for the 'cpm' (counts per million) element of an SCESet object.
calculateFPKM

Calculate fragments per kilobase of exon per million reads mapped (FPKM)
summariseExprsAcrossFeatures

Summarise expression values across feature
get_exprs

Generic accessor for expression data from an SCESet object.
plot

Plot an overview of expression for each cell
scater_gui

scater GUI function
plotQC

Produce QC diagnostic plots
readSalmonResultsOneSample

Read Salmon results for a single sample into a list
getBMFeatureAnnos

Get feature annotation information from Biomart
fpkm

Accessors for the 'fpkm' (fragments per kilobase of exon per million reads mapped) element of an SCESet object.
multiplot

Multiple plot function for ggplot2 plots
normaliseExprs

Normalise expression expression levels for an SCESet object
plotFeatureData

Plot feature (gene) data from an SCESet object
sizeFactors

Accessors size factors of an SCESet object.
getExprs

Retrieve a representation of gene expression
bootstraps

Accessor and replacement for bootstrap results in an SCESet object
fData<-,SCESet,AnnotatedDataFrame-method

Replaces featureData in an SCESet object
plotHighestExprs

Plot the features with the highest expression values
calcIsExprs

Calculate which features are expressed in which cells using a threshold on observed counts, transcripts-per-million, counts-per-million, FPKM, or defined expression levels.
norm_counts

Accessors for the 'norm_counts' element of an SCESet object.
plotDiffusionMap

Plot a diffusion map for an SCESet object
normalize

Normalise an SCESet object using pre-computed size factors
sc_example_counts

A small example of single-cell counts dataset to demonstrate capabilities of scater
plotExprsFreqVsMean

Plot frequency of expression against mean expression level
stand_exprs

Accessors for the 'stand_exprs' (standardised expression) element of an SCESet object.
sc_example_cell_info

Cell information for the small example single-cell counts dataset to demonstrate capabilities of scater
plotPhenoData

Plot phenotype data from an SCESet object
SCESet

The "Single Cell Expression Set" (SCESet) class
counts

Accessors for the 'counts' element of an SCESet object.
calculateQCMetrics

Calculate QC metrics
calculateTPM

Calculate transcripts-per-million (TPM)
newSCESet

Create a new SCESet object.
norm_cpm

Accessors for the 'norm_cpm' (normalised counts per million) element of an SCESet object.
plotExpression

Plot expression values for a set of features (e.g. genes or transcripts)
plotReducedDim

Plot reduced dimension representation of cells
norm_tpm

Accessors for the 'norm_tpm' (transcripts per million) element of an SCESet object.
rename

Rename variables of pData(object).
plotPCA

Plot PCA for an SCESet object
plotTSNE

Plot t-SNE for an SCESet object
findImportantPCs

Find most important principal components for a given variable
fromCellDataSet

Convert a CellDataSet to an SCESet
norm_exprs

Accessors for the 'norm_exprs' (normalised expression) element of an SCESet object.
norm_fpkm

Accessors for the 'norm_fpkm' (normalised fragments per kilobase of exon per million reads mapped) element of an SCESet object.
mutate

Add new variables to pData(object).
featurePairwiseDistances

featurePairwiseDistances in an SCESet object
readTxResults

Read transcript quantification data with tximport package
scater-package

Single-cell analysis toolkit for expression in R