GeneSelection: General method for variable selection with various methods

Description

For different learning data sets as defined by the argument learningsets, this method ranks the genes from the most relevant to the less relevant using one of various 'filter' criteria or provides a sparse collection of variables (Lasso, ElasticNet, Boosting). The results are typically used for variable selection for the classification procedure that follows. For S4 class information, s. GeneSelection-methods.

Usage

GeneSelection(X, y, f, learningsets, method = c("t.test", "welch.test", "wilcox.test", "f.test", "kruskal.test", "limma", "rfe", "rf", "lasso", "elasticnet", "boosting", "golub", "shrinkcat"), scheme, trace = TRUE, ...)

Arguments

Gene expression data. Can be one of the following:

A matrix. Rows correspond to observations, columns to variables.
A data.frame, when f is not missing (s. below).
An object of class ExpressionSet.

Class labels. Can be one of the following:

A numeric vector.
A factor.
A character if X is an ExpressionSet.
missing, if X is a data.frame and a proper formula f is provided.

A two-sided formula, if X is a data.frame. The left part correspond to class labels, the right to variables.

learningsets

An object of class learningsets. May be missing, then the complete datasets is used as learning set.

method

A character specifying the method to be used:

t.test: two-sample t.test (equal variances for both classes assumed).
welch.test: Welch modification of the t.test (unequal variances for both classes).
wilcox.test: Wilcoxon rank sum test.
f.test: F test belonging to the linear hypothesis that the mean is the same for all classes. Usually used for the multiclass scheme, is equivalent to method = t.test in the two-class case.
kruskal.test: Multi-class generalization of the Wilcoxon rank sum test and the nonparametric pendant to the F test, respectively.
limma: 'Moderated t' statistic for the two-class case and 'moderated F' statistic for the multiclass case, described in Smyth (2003). Requires the package limma.
rfe: One-step Recursive Feature Elimination, based on the Support Vector Machine. The method is decribed in Guyon et al. (2002). Requires the package e1071. Take care that appropriate hyperparameters are passed by the ... argument.
rf: Random Forest Variable Importance Measure. Requires the package randomForest
lasso: L1 penalized logistic regression leads to sparsity with respect to the variables used. Calls the function LassoCMA, which requires the package glmpath. warning: Take care that appropriate hyperparameters are passed by the ... argument.
elasticnet: Penalized logistic regression with both L1 and L2 penalty, claimed by Zhou and Hastie (2004) to select 'variable groups'. Calls the function ElasticNetCMA, which requires the package glmpath. warning: Take care that appropriate hyperparameters are passed by the ... argument.
boosting: Componentwise boosting (Buehlmann and Yu, 2003) has been shown to mimic the LASSO (Efron et al., 2004; Buehlmann and Yu, 2006). Calls the function compBoostCMA Take care that appropriate hyperparameters are passed by the ... argument.
golub: The (theoretically unfounded) variable selection criterion used by Golub et al. (1999), s. golub.
shrinkcat: The correlation-adjusted t-score from Zuber and Strimmer (2009)

scheme

The scheme to be used in the case of a non-binary response. Must be one of "pairwise","one-vs-all" or "multiclass". The last case only makes sense if method is one of f.test, limma, rf, boosting, which can directly be applied to the multi class case.

trace

Should the progress be traced ? Default is TRUE.

...

Further arguments passed to the function performing variable selection, s. method.

Value

genesel.

References

Smyth, G. K., Yang, Y.-H., Speed, T. P. (2003). Statistical issues in microarray data analysis. Methods in Molecular Biology 224, 111-136.

Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002). Gene Selection for Cancer Classification using support vector machines. Journal of Machine Learning Research, 46, 389-422

Zhou, H., Hastie, T. (2004). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67(2),301-320

Buelmann, P., Yu, B. (2003). Boosting with the L2 loss: Regression and Classification. Journal of the American Statistical Association, 98, 324-339 Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004). Least Angle Regression. Annals of Statistics, 32:407-499

Buehlmann, P., Yu, B. (2006). Sparse Boosting. Journal of Machine Learning Research, 7- 1001:1024 Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439

Examples

Run this code

# load Golub AML/ALL data
data(golub)
### extract class labels
golubY <- golub[,1]
### extract gene expression from first 10 genes
golubX <- as.matrix(golub[,-1])
### Generate five different learningsets
set.seed(111)
five <- GenerateLearningsets(y=golubY, method = "CV", fold = 5, strat = TRUE)
### simple t-test:
selttest <- GeneSelection(golubX, golubY, learningsets = five, method = "t.test")
### show result:
show(selttest)
toplist(selttest, k = 10, iter = 1)
plot(selttest, iter = 1)

Run the code above in your browser using DataLab