stattest: statistical tests for differential expression in ballgown

Description

Test each transcript, gene, exon, or intron in a ballgown object for differential expression, using comparisons of linear models.

Usage

stattest(gown = NULL, gowntable = NULL, pData = NULL, mod = NULL, mod0 = NULL, feature = c("gene", "exon", "intron", "transcript"), meas = c("cov", "FPKM", "rcount", "ucount", "mrcount", "mcov"), timecourse = FALSE, covariate = NULL, adjustvars = NULL, gexpr = NULL, df = 4, getFC = FALSE, libadjust = NULL, log = TRUE)

Arguments

gown

name of an object of class ballgown

gowntable

matrix or matrix-like object with rownames representing feature IDs and columns representing samples, with expression estimates in the cells. Provide the feature name with feature. You must provide exactly one of gown or gowntable. NB: gowntable is log-transformed within stattest if log is TRUE, so provide un-logged expression values in gowntable.

pData

Required if gowntable is provided: data frame giving phenotype data for the samples in the columns of gowntable. (Rows of pData correspond to columns of gowntable). If gown is used instead, it must have a non-null, valid pData slot (and the pData argument to stattest should be left NULL).

mod

object of class model.matrix representing the design matrix for the linear regression model including covariates of interest

mod0

object of class model.matrix representing the design matrix for the linear regression model without the covariates of interest.

feature

the type of genomic feature to be tested for differential expression. If gown is used, must be one of "gene", "transcript", "exon", or "intron". If gowntable is used, this is just used for labeling and can be whatever the rows of gowntable represent.

meas

the expression measurement to use for statistical tests. Must be one of "cov", "FPKM", "rcount", "ucount", "mrcount", or "mcov". Not all expression measurements are available for all features. Leave as default if gowntable is provided.

timecourse

if TRUE, tests whether or not the expression profiles of genomic features vary over time (or another continuous covariate) in the study. Default FALSE. Natural splines are used to fit time profiles, so you must have more timepoints than degrees of freedom used to fit the splines. The default df is 4.

covariate

string representing the name of the covariate of interest for the differential expression tests. Must correspond to the name of a column of pData(gown). If timecourse=TRUE, this should be the study's time variable.

adjustvars

optional vector of strings representing the names of potential confounders. Must correspond to names of columns of pData(gown).

gexpr

optional data frame that is the result of calling gexpr(gown)). (You can speed this function up by pre-creating gexpr(gown).)

degrees of freedom used for modeling expression over time with natural cubic splines. Default 4. Only used if timecourse=TRUE.

getFC

if TRUE, also return estimated fold changes (adjusted for library size and confounders) between populations. Only available for 2-group comparisons at the moment. Default FALSE.

libadjust

library-size adjustment to use in linear models. By default, the adjustment is defined as the sum of the sample's log expression measurements below the 75th percentile of those measurements. To use a different library-size adjustment, provide a numeric vector of each sample's adjustment value. Entries of this vector correspond to samples in in rows of pData. If no library size adjustment is desired, set to FALSE.

log

if TRUE, outcome variable in linear models is log(expression+1), otherwise it's expression. Default TRUE.

Value

data frame containing the columns feature, id representing feature id, pval representing the p-value for testing whether this feature was differentially expressed according to covariate, and qval, the estimated false discovery rate using this feature's signal strength as a significance cutoff. An additional column, fc, is included if getFC is TRUE.

Details

At minimum, you need to provide a ballgown object or count table, the type of feature you want to test (gene, transcript, exon, or intron), the expression measurement you want to use (FPKM, cov, rcount, etc.), and the covariate of interest, which must be the name of one of the columns of the `pData` component of your ballgown object (or provided pData). This covariate is automatically converted to a factor during model fitting in non-timecourse experiments.

By default, models are fit using log2(meas + 1) as the outcome for each feature. To disable the log transformation, provide `log = FALSE` as an argument to `stattest`. You can use the gowntable option if you'd like to to use a different transformation.

Library size adjustment is performed by default by using the sum of the log nonzero expression measurements for each sample, up to the 75th percentile of those measurements. This adjustment can be disabled by setting libadjust=FALSE. You can use mod and mod0 to specify alternative library size adjustments.

mod and mod0 are optional arguments. If mod is specified, you must also specify mod0. If neither is specified, mod0 defaults to the design matrix for a model including only a library-size adjustment, and mod defaults to the design matrix for a model including a library-size adjustment and covariate. Note that if you supply mod and mod0, covariate, timecourse, adjustvars, and df are ignored, so make sure your covariate of interest and all appropriate confounder adjustments, including library size, are specified in mod and mod0. By default, the library-size adjustment is the sum of all counts below the 75th percentile of nonzero counts, on the log scale (log2 + 1).

Full model details are described in the supplement of http://biorxiv.org/content/early/2014/03/30/003665.

References

http://biorxiv.org/content/early/2014/03/30/003665

Examples

Run this code

data(bg)

# two-group comparison:
stat_results = stattest(bg, feature='transcript', meas='FPKM', 
  covariate='group')

# timecourse test:
pData(bg) = data.frame(pData(bg), time=rep(1:10, 2)) #dummy time covariate
timecourse_results = stattest(bg, feature='transcript', meas='FPKM', 
  covariate='time', timecourse=TRUE)

# timecourse test, adjusting for group:
group_adj_timecourse_results = stattest(bg, feature='transcript', 
  meas='FPKM', covariate='time', timecourse=TRUE, adjustvars='group')

# custom model matrices:
### create example data:
set.seed(43)
sex = sample(c('M','F'), size=nrow(pData(bg)), replace=TRUE)
age = sample(21:52, size=nrow(pData(bg)), replace=TRUE)

### create design matrices:
mod = model.matrix(~ sex + age + pData(bg)$group + pData(bg)$time)
mod0 = model.matrix(~ pData(bg)$group + pData(bg)$time)

### build model: 
adjusted_results = stattest(bg, feature='transcript', meas='FPKM', 
  mod0=mod0, mod=mod)

Run the code above in your browser using DataLab