runGSA: Gene set analysis

Description

Performs gene set analysis (GSA) based on a given number of gene-level statistics and a gene set collection, using a variety of available methods, returning the gene set statistics and p-values of different directionality classes.

Usage

runGSA(geneLevelStats, directions=NULL,  geneSetStat="mean",  signifMethod="geneSampling",  adjMethod="fdr", gsc, gsSizeLim=c(1,Inf), permStats=NULL, permDirections=NULL, nPerm=1e4, gseaParam=1, ncpus=1, verbose=TRUE)

Arguments

geneLevelStats

a vector or a one-column data.frame or matrix, containing the gene level statistics. Gene level statistics can be e.g. p-values, t-values or F-values.

directions

a vector or a one-column data.frame or matrix, containing fold-change like values for the related gene-level statistics. This is mainly used if statistics are p-values or F-values, but not required. The values should be positive or negative, but only the sign information will be used, so the actual value will not matter.

geneSetStat

the statistical GSA method to use. Can be one of "fisher", "stouffer", "reporter", "tailStrength", "wilcoxon", "mean", "median", "sum", "maxmean", "gsea" or "page". See below for details.

signifMethod

the method for significance assessment of gene sets, i.e. p-value calculation. Can be one of "geneSampling", "samplePermutation" or "nullDist"

adjMethod

the method for adjusting for multiple testing. Can be any of the methods supported by p.adjust, i.e. "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr" or "none". The exception is for geneSetStat="gsea", where only the options "fdr" and "none" can be used.

gsc

a gene set collection given as an object of class GSC as returned by the loadGSC function.

gsSizeLim

a vector of length two, giving the minimum and maximum gene set size (number of member genes) to be kept for the analysis. Defaults to c(1,Inf).

permStats

a matrix with permutated gene-level statistics (columns) for each gene (rows). This should be calculated by the user by randomizing the sample labels in the original data, and recalculating the gene level statistics for each comparison a large number of times, thus generating a vector (rows in the matrix) of background statistics for each gene. This argument is required and only used if signifMethod="samplePermutation".

permDirections

similar to permStats, but should instead contain fold-change like values for the related permutated statistics. This is mainly used if the statistics are p-values or F-values, but not required. The values should be positive or negative, but only the sign information will be used, so the actual value will not matter. This argument is only used if signifMethod="samplePermutation", but not required. Note however, that if directions is give then also permDirections is required, and vice versa.

nPerm

the number of permutations to use for gene sampling, i.e. if signifMethod="geneSampling". The original Reporter features algorithm (geneSetStat="reporter" and signifMethod="nullDist") also uses a permutation step which is controlled by nPerm.

gseaParam

the exponent parameter of the GSEA approach. This defaults to 1, as recommended by the GSEA authors.

ncpus

the number of cpus to use. If larger than 1, the gene permutation part will be run in parallel and thus decrease runtime. Requires R package snowfall to be installed. Should be set so that nPerm/ncpus is a positive integer.

verbose

a logical. Whether or not to display progress messages during the analysis.

Value

geneStatType: The interpretated type of gene-level statistics
geneSetStat: The method for gene set statistic calculation
signifMethod: The method for significance estimation
adjMethod: The method of adjustment for multiple testing
info: A list object with detailed info number of genes and gene sets
gsSizeLim: The selected gene set size limits
gsStatName: The name of the gene set statistic type
nPerm: The number of permutations
gseaParam: The GSEA parameter
geneLevelStats: The input gene-level statistics
directions: The input directions
gsc: The input gene set collection
nGenesTot: The total number of genes in each gene set
nGenesUp: The number of up-regulated genes in each gene set
nGenesDn: The number of down-regulated genes in each gene set
statDistinctDir: Gene set statistics of the distinct-directional class
statDistinctDirUp: Gene set statistics of the distinct-directional class
statDistinctDirDn: Gene set statistics of the distinct-directional class
statNonDirectional: Gene set statistics of the non-directional class
statMixedDirUp: Gene set statistics of the mixed-directional class
statMixedDirDn: Gene set statistics of the mixed-directional class
pDistinctDirUp: Gene set p-values of the distinct-directional class
pDistinctDirDn: Gene set p-values of the distinct-directional class
pNonDirectional: Gene set p-values of the non-directional class
pMixedDirUp: Gene set p-values of the mixed-directional class
pMixedDirDn: Gene set p-values of the mixed-directional class
pAdjDistinctDirUp: Adjusted gene set p-values of the distinct-directional class
pAdjDistinctDirDn: Adjusted gene set p-values of the distinct-directional class
pAdjNonDirectional: Adjusted gene set p-values of the non-directional class
pAdjMixedDirUp: Adjusted gene set p-values of the mixed-directional class
pAdjMixedDirDn: Adjusted gene set p-values of the mixed-directional class
runtime: The execution time in seconds

Details

The rownames of geneLevelStats and directions should be identical and match the names of the members of the gene sets in gsc. If geneSetStat is set to "fisher", "stouffer", "reporter" or "tailStrength" only p-values are allowed as geneLevelStats. If geneSetStat is set to "maxmean", "gsea" or "page" only t-like geneLevelStats are allowed (e.g. t-values, fold-changes). For geneSetStat set to "fisher", "stouffer", "reporter", "wilcoxon" or "page", the gene set p-values can be calculated from a theoretical null-distribution, in this case, set signifMethod="nullDist". For all methods signifMethod="geneSampling" or signifMethod="samplePermutation" can be used. If signifMethod="geneSampling" gene sampling is used, meaning that the gene labels are randomized nPerm times and the gene set statistics are recalculated so that a background distribution for each original gene set is acquired. The gene set p-values are calculated based on this background distribution. Similarly if signifMethod="samplePermutation" sample permutation is used. In this case the argument permStats (and optionally permDirections) has to be supplied. The runGSA function returns p-values for each gene set. Depending on the choice of methods and gene statistics up to three classes of p-values can be calculated, describing different aspects of regulation directionality. The three directionality classes are Distinct-directional, Mixed-directional and Non-directional. The non-directional p-values (pNonDirectional) are calculated based on absolute values of the gene statistics (or p-values without sign information), meaning that gene sets containing a high portion of significant genes, independent of direction, will turn up significant. That is, gene-sets with a low pNonDirectional should be interpreted to be significantly affected by gene regulation, but there can be a mix of both up and down regulation involved. The mixed-directional p-values (pMixedDirUp and pMixedDirDn) are calculated using the subset of the gene statistics that are up-regulated and down-regulated, respectively. This means that a gene set with a low pMixedDirUp will have a component of significantly up-regulated genes, disregardful of the extent of down-regulated genes, and the reverse for pMixedDirDn. This also means that one can get gene sets that are both significantly affected by down-regulation and significantly affected by up-regulation at the same time. Note that sample permutation cannot be used to calculate pMixedDirUp and pMixedDirDn since the subset sizes will differ. Finally, the distinct-directional p-values (pDistinctDirup and pDistinctDirDn) are calculated from statistics with sign information (e.g. t-statistics). In this case, if a gene set contains both up- and down-regulated genes, they will cancel out each other. A gene-set with a low pDistinctDirUp will be significantly affected by up-regulation, but not a mix of up- and down-regulation (as in the case of the mixed-directional and non-directional p-values). In order to be able to calculate distinct-directional gene set p-values while using p-values as gene-level statistics, the gene-level p-values are transformed as follows: The up-regulated portion of the p-values are divided by 2 (scaled to range between 0-0.5) and the down-regulated portion of p-values are set to 1-p/2 (scaled to range between 1-0.5). This means that a significantly down-regulated gene will get a p-value close to 1. These new p-values are used as input to the gene-set analysis procedure to get pDistinctDirUp. Similarly, the opposite is done, so that the up-regulated portion is scaled between 1-0.5 and the down-regulated between 0-0.5 to get the pDistinctDirDn.

References

Fisher, R. Statistical methods for research workers. Oliver and Boyd, Edinburgh, (1932).

Stouffer, S., Suchman, E., Devinney, L., Star, S., and Williams Jr, R. The American soldier: adjustment during army life. Princeton University Press, Oxford, England, (1949).

Patil, K. and Nielsen, J. Uncovering transcriptional regulation of metabolism by using metabolic network topology. Proceedings of the National Academy of Sciences of the United States of America 102(8), 2685 (2005).

Oliveira, A., Patil, K., and Nielsen, J. Architecture of transcriptional regulatory circuits is knitted over the topology of bio-molecular interaction networks. BMC Systems Biology 2(1), 17 (2008).

Kim, S. and Volsky, D. Page: parametric analysis of gene set enrichment. BMC bioinformatics 6(1), 144 (2005).

Taylor, J. and Tibshirani, R. A tail strength measure for assessing the overall univariate significance in a dataset. Biostatistics 7(2), 167-181 (2006).

Mootha, V., Lindgren, C., Eriksson, K., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., et al. Pgc-1-alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature genetics 34(3), 267-273 (2003).

Subramanian, A., Tamayo, P., Mootha, V., Mukherjee, S., Ebert, B., Gillette, M., Paulovich, A., Pomeroy, S., Golub, T., Lander, E., et al. Gene set enrichment analysis: a knowledgebased approach for interpreting genom-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102(43), 15545-15550 (2005).

Efron, B. and Tibshirani, R. On testing the significance of sets of genes. The Annals of Applied Statistics 1, 107-129 (2007).

Examples

Run this code

   # Load example input data to GSA:
   data("gsa_input")
   
   # Load gene set collection:
   gsc <- loadGSC(gsa_input$gsc)
      
   # Run gene set analysis:
   gsares <- runGSA(geneLevelStats=gsa_input$pvals , directions=gsa_input$directions, 
                    gsc=gsc, nPerm=500)

Run the code above in your browser using DataLab