get.biom: Get biomarkers discriminating between two classes

Description

Biomarkers can be identified in several ways: the classical way is to look at those variables with large model coefficients or large t statistics. One other is based on the higher criticism approach (HC), and the third possibility assesses the stability of these coefficients under subsampling of the data set.

Usage

get.biom(X, Y, fmethod = "all", type = c("stab", "HC", "coef"), ncomp = 2, biom.opt = biom.options(), scale.p = "auto", ...)
"coef"(object, ...)
"print"(x, ...)
"summary"(object, ...)

Arguments

Data matrix. Usually the number of columns (variables) is (much) larger than the number of rows (samples).

Class indication. For classification with two or more factors a factor; a numeric vector will be interpreted as a regression situation, which can only be tackled by fmethod = "lasso".

fmethod

Modelling method(s) employed. The default is to use "all", which will test all methods in the current biom.options$fmethods list. Note that from version 0.4.0, "plsda" and "pclda" are no longer in the list of methods - they have been replaced by "pls" and "pcr", respectively. For compatibility reasons, using the old terms will not lead to an error but only a warning.

type

Whether to use coefficient size as a criterion ("coef"), "stab" or "HC".

ncomp

Number of latent variables to use in PCR and PLS (VIP) modelling. In function get.biom this may be a vector; in all other functions it should be one number. Default: 2.

biom.opt

Options for the biomarker selection - a list with several named elements. See biom.options.

scale.p

Scaling. This is performed individually in every crossvalidation iteration, and can have a profound effect on the results. Default: "auto" (autoscaling). Other possible choices: "none" for no scaling, "pareto" for pareto scaling, "log" and "sqrt" for log and square root scaling, respectively.

object, x

A BMark object.

...

Further arguments for modelling functions. Often used to catch unused arguments.

Value

get.biom returns an object of class "BMark", a list containing an element for every fmethod that is selected, as well as an element info. The individual elements contain information depending on the type chosen: for type == "coef", the only element returned is a matrix containing coefficient sizes. For type == "HC" and type == "stab", a list is returned containing elements biom.indices, and either pvals (for type == "HC") or fraction.selected (for type == "stab"). Element biom.indices contains the indices of the selected variables, and can be extracted using function selection. Element pvals contains the p values used to perform HC thresholding; these are presented in the original order of the variables, and can be obtained directly from e.g. t statistics, or from permutation sampling. Element fraction.selected indicates in what fraction of the stability selection iterations a particular variable has been selected. The more often it has been selected, the more stable it is as a biomarker. Generic function coef.biom extracts model coefficients, p values or stability fractions for types "coef", "HC" and "stab", respectively.

Examples

Run this code

## Real apple data (small set)
data(spikedApples)
apple.coef <- get.biom(X = spikedApples$dataMatrix,
                       Y = factor(rep(1:2, each = 10)),
                       ncomp = 2:3, type = "coef")
coef.sizes <- coef(apple.coef) 
sapply(coef.sizes, range)

## stability-based selection
set.seed(17)
apple.stab <- get.biom(X = spikedApples$dataMatrix,
                       Y = factor(rep(1:2, each = 10)),
                       ncomp = 2:3, type = "stab")
selected.variables <- selection(apple.stab)
unlist(sapply(selected.variables, function(x) sapply(x, length)))
## Ranging from more than 70 for pcr, approx 40 for pls and student t,
## to 0-29 for the lasso
unlist(sapply(selected.variables,
              function(x) lapply(x, function(xx, y) sum(xx %in% y),
              spikedApples$biom)))
## TPs (stab): all find 5/5, except pcr.2 and the lasso with values for lambda
## larger than 0.0484

unlist(sapply(selected.variables,
              function(x) lapply(x, function(xx, y) sum(!(xx %in% y)),
              spikedApples$biom)))
## FPs (stab): PCR finds most FPs (approx. 60), other latent-variable
## methods approx 40, lasso allows for the optimal selection around 
## lambda = 0.0702

## regression example
data(gasoline) ## from the pls package
gasoline.stab <- get.biom(gasoline$NIR, gasoline$octane,
                          fmethod = c("pcr", "pls", "lasso"), type = "stab")


## Not run: 
# ## Same for HC-based selection
# ## Warning: takes a long time!
# apple.HC <- get.biom(X = spikedApples$dataMatrix,
#                      Y = factor(rep(1:2, each = 10)),
#                      ncomp = 2:3, type = "HC")
# sapply(apple.HC[names(apple.HC) != "info"],
#        function(x, y) sum(x$biom.indices %in% y),
#        spikedApples$biom)
# sapply(apple.HC[names(apple.HC) != "info"],
#        function(x, y) sum(!(x$biom.indices %in% y)),
#        spikedApples$biom)
# ## End(Not run)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples