gsea: GSEA (Gene Set Enrichment Analysis).

Description

Computes the enrichment scores and simulated enrichment scores for each variable and signature. An important parameter of the function is logScale. Its default value is TRUE which means that by default the provided scores (i.e. fold changes, hazard ratios) will be log scaled. Remember to change this parameter to FALSE if your scores are already log scaled. The getEs, getEsSim, getFc, getHr and getFcHr methods can be used to acces each subobject. For more information please visit the man pages of each method.

It also computes the NES (normalized enrichment score), p values and fdr (false discovery rate) for all variables and signatures. For an overview of the output use the summary method.

In case of providing gene sets which have more than 10 distinct lengths an approximation of the calculation of the enrichment score simulations (ESM) will be computed. The value of the ESM only depends on the length of the gene set. Therefore we compute the ESM over a grid of possible gene set lengths which are representative of the lengths of the provided gene sets. Then we fit a generalized additive model model with cubic splines to predict the NES value based on the length of every gene set. This provides a much faster approach that can be very useful when we need to run the software over a huge number of gene sets.

Usage

gsea(x,gsets,logScale=TRUE, absVals=FALSE, averageRepeats=FALSE, B=1000, mc.cores=1, test="perm",p.adjust.method="none", pval.comp.method="original",pval.smooth.tail=TRUE,minGenes=10, maxGenes=500,center=FALSE)

Arguments

ePhenoTest, numeric or matrix object containing scores (hazard ratios or fold changes).

gsets

character or list object containing the names of the genes that belong to each signature.

logScale

if values should be log scaled.

absVals

if TRUE fold changes and hazard ratios that are negative will be turned into positive before starting the process. This is useful when genes can go in both directions.

averageRepeats

if x is of class numeric and has repeated names (several measures for some indivdual names) we can average the measures of the same names.

number of simulations to perform.

mc.cores

number of processors to use.

test

the test that will be used. 'perm' stands for the permutation based method, 'wilcox' stands for the wilcoxon test (this is the fastest one) and 'ttperm' stands for permutation t test.

p.adjust.method

p adjustment method to be used. Common options are 'BH', 'BY', 'bonferroni' or 'none'. All available options and their explanations can be found on the p.adjust function manual.

pval.comp.method

the p value computation method. Has to be one of 'signed' or 'original'. The default one is 'original'. See details for more information.

pval.smooth.tail

if we want to estimate the tail of the ditribution where the pvalues will be generated.

minGenes

gene sets with less than minGenes genes will be removed from the analysis.

maxGenes

gene sets with more than maxGenes genes will be removed from the analysis.

center

if we want to center scores (fold changes or hazard ratios). The following is will be done: x = x-mean(x).

Details

The following preprocessing was done on the provided scores (i.e. fold changes, hazard ratios) to avoid errors during the enrichment score computation: -When having two scores with the same name its average was used. -Zeros were removed. -Scores without names (which can not be in any signature) removed. -Non complete cases (i.e. NAs, NaNs) were removed. ES score was calculated for each signature and variable (see references). If parameter test is 'perm' the signature was permutted and the ES score was recalculated (this happened B times for each variable, 1000 by default). If test is 'wilcox' a wilcoxon test in which we test the fact that the average value of the genes that do belong to our signtaure is different from the average value of the genes that do not belong to our signature will be performed. If test is 'ttperm' a permutation t-test will be used. Take into account that the final plot will be different when 'wilcox' is used.

The simulated enrichment scores and the calculated one are used to find the p value. P value calculation depends on the parameter pval.comp.method. The default value is 'original'. In 'original' we are simply computing the proportion of anbolute simulated ES which are larger than the observed absolute ES. In 'signed' we are computing the proportion of simulated ES which are larger than the observed ES (in case of having positive enrichment score) and the proportion of simulated ES which are smaller than the observed ES (in case of having negative enrichment score).

References

Aravind Subramanian, (October 25, 2005) Gene Set Enrichment Analysis. www.pnas.org/cgi/doi/10.1073/pnas.0506580102

C.A. Tsai and J.J. Chen. Kernel estimation for adjusted p-values in multiple testing. Computational Statistics & Data Analysis http://econpapers.repec.org/article/eeecsdana/v_3a51_3ay_3a2007_3ai_3a8_3ap_3a3885-3897.htm

Examples

Run this code

#load epheno object
data(epheno)
epheno

#we construct two signatures
sign1 <- sample(featureNames(epheno))[1:20]
sign2 <- sample(featureNames(epheno))[50:75]
mySignature <- list(sign1,sign2)
names(mySignature) <- c('My first signature','My preferred signature')

#run gsea functions
gseaData <- gsea(x=epheno,gsets=mySignature,B=100,mc.cores=1)
my.summary <- summary(gseaData)
my.summary 
#plot(gseaData)