GSNormalize: Aggregating and calculating expression statistics by Gene Set

Description

Provides an interface for producing aggregate gene-set statistics, for gene-set-enrichment analysis (GSEA). The function is best suited for mean or rescaled-mean GSEA approaches, but is hopefully generic enough to enable other approaches as well.

Usage

GSNormalize(dataset, incidence, gseaFun = crossprod, fun1 = "/", fun2 = sqrt, removeShift=FALSE, removeStat=mean, ...) 
identity(x)
one(x)

Arguments

dataset

a numeric matrix, typically of some gene-level statistics

incidence

0/1 incidence matrix indicating genes' membership in gene-sets

gseaFun

function name for the type of aggregation to take place, defaults to 'crossprod'. See 'Details'

fun1

function name for normalization, defaults to "/". See 'Details'

fun2

function name for scaling, defaults to 'sqrt'. See 'Details'

removeShift

logical: should normalization begin with a column-wise removal of the mean shift?

removeStat

(if above is TRUE) the column-wise statistic to be swept out of 'dataset'.

...

Additional arguments optionally passed on to 'gseaFun'.

any numerical value

Value

'GSNormalize' returns a matrix with the same number of rows as 'incidence' and the same number of columns as 'dataset' (if 'dataset' is a vector, the output will be a vector as well). The respective row and column names will carry through from 'dataset' and 'incidence' to the output.'identity' simply returns x. 'one' returns the number 1.

Details

In gene-set-enrichment analysis (GSEA), the core step is aggregating (or calculating) gene-set-level statistics from gene-set statistics. This utility achieves the feat. It is tailored specifically for rescaled-sums of the type suggested by Jiang and Gentleman (2007), but is designed as a generic template that should other GSEA approaches. In such cases, at this moment users should provide their own version of 'gseaFun'.

The default will generate sums of gene-level values divided by the square-root of the gene-set size (in other words, gene-set means multiplied by the square-root of gene-set size). The arithmetic works like this:

gene-set stat = gseaFun(t(incidence),dataset),...) 'fun1' fun2(gene-set size).

In case there is a known (or suspected) overall baseline shift (i.e., the mass of gene-level stats is not centered around zero) it may be scientifically more meaningful to look for gene-set deviating from this baseline rather than from zero. In this case, you can set 'removeShift=TRUE'.

Also provided are the 'identity' function (identity = function(x) x), so that leaving 'gseaFun' and 'fun1' at their default and setting 'fun2 = identity' will generate gene-set means -- and the 'one' function to neutralize the effect of both 'fun1' and 'fun2' (see note below).

References

Z. Jiang and R. Gentleman, "Extensions to Gene Set Enrichment Analysis",Bioinformatics (23),306-313, 2007.

Examples

Run this code

data(sample.ExpressionSet)
lm1 = lmPerGene(sample.ExpressionSet,~sex+type)

### Generating random pseudo-gene-sets
fauxGS=matrix(sample(c(0,1),size=50000,replace=TRUE,prob=c(.9,.1)),nrow=100)

### "tau-stats" for gene-SET-level type effect, adjusting for sex
fauxEffects=GSNormalize(lm1$coefficients[3,]/sqrt(lm1$coef.var[3,]),incidence=fauxGS)

qqnorm(fauxEffects)
### diagonal line represents zero-shift null; note that it doesn't fit
abline(0,1,col=2)
### a better option may be to run a diagonal through the middle of the
### data (nonzero-shift null, i.e. type may have an effect but it is the
### same for all gene-sets); note that if any outlier shows, it is a purely random one!

abline(median(fauxEffects),1,col=4)

#### Now try with baseline-shift removal

fauxEffects=GSNormalize(lm1$coefficients[3,]/sqrt(lm1$coef.var[3,]),incidence=fauxGS,removeShift=TRUE)

qqnorm(fauxEffects)
abline(0,1,col=2)

Run the code above in your browser using DataLab