Wrapper of all tuning functions.
tune(method,
X,
Y,
multilevel,
ncomp,
study, # mint.splsda
test.keepX = c(5, 10, 15), # all but pca, rcc
test.keepY = NULL, # rcc, multilevel
already.tested.X, # all but pca, rcc
already.tested.Y, #multilevel
mode = "regression", # multilevel
nrepeat = 1, #multilevel, splsda
grid1 = seq(0.001, 1, length = 5), # rcc
grid2 = seq(0.001, 1, length = 5), # rcc
validation = "Mfold", # all but pca
folds = 10, # all but pca
dist = "max.dist", # all but pca, rcc
measure = c("BER"), # all but pca, rcc
auc = FALSE,
progressBar = TRUE, # all but pca, rcc
near.zero.var = FALSE, # all but pca, rcc
logratio = "none", # all but pca, rcc
center = TRUE, # pca
scale = TRUE, # mint, splsda
max.iter = 100, #pca
tol = 1e-09,
light.output = TRUE # mint, splsda
)
This parameter is used to pass all other argument to the suitable function. method
has to be one of the following:
"spls", "splsda", "mint.splsda", "rcc", "pca".
numeric matrix of predictors. NA
s are allowed.
Either a factor or a class vector for the discrete outcome, or a numeric vector or matrix of continuous responses (for multi-response models).
Design matrix for multilevel anaylis (for repeated measurements) that indicates the repeated measures on each individual, i.e. the individuals ID. See Details.
the number of components to include in the model.
grouping factor indicating which samples are from the same study
numeric vector for the different number of variables to test from the \(X\) data set
If method = 'spls'
, numeric vector for the different number of variables to test from the \(Y\) data set
Optional, if ncomp > 1
A numeric vector indicating the number of variables to select from the \(X\) data set on the firsts components.
if method = 'spls'
and if(ncomp > 1)
numeric vector indicating the number of variables to select from the \(Y\) data set on the first components
character string. What type of algorithm to use, (partially) matching
one of "regression"
, "canonical"
, "invariant"
or "classic"
.
See Details.
Number of times the Cross-Validation process is repeated.
vector numeric defining the values of lambda1
and lambda2
at which cross-validation score should be computed. Defaults to
grid1=grid2=seq(0.001, 1, length=5)
.
character. What kind of (internal) validation to use, matching one of "Mfold"
or
"loo"
(see below). Default is "Mfold"
.
the folds in the Mfold cross-validation. See Details.
distance metric to use for splsda
to estimate the classification error rate,
should be a subset of "centroids.dist"
, "mahalanobis.dist"
or "max.dist"
(see Details).
Two misclassification measure are available: overall misclassification error overall
or the Balanced Error Rate BER
if TRUE
calculate the Area Under the Curve (AUC) performance of the model.
by default set to TRUE
to output the progress bar of the computation.
boolean, see the internal nearZeroVar
function (should be set to TRUE in particular for data with many zero values). Default value is FALSE
one of ('none','CLR'). Default to 'none'
a logical value indicating whether the variables should be shifted to be zero centered.
Alternately, a vector of length equal the number of columns of X
can be supplied.
The value is passed to scale
.
a logical value indicating whether the variables should be scaled to have
unit variance before the analysis takes place. The default is FALSE
for consistency with prcomp
function, but in general scaling is advisable. Alternatively, a vector of length equal the number of
columns of X
can be supplied. The value is passed to scale
.
integer, the maximum number of iterations for the NIPALS algorithm.
a positive real, the tolerance used for the NIPALS algorithm.
if set to FALSE, the prediction/classification of each sample for each of test.keepX
and each comp is returned.
Depending on the type of analysis performed and the input arguments, a list that may contain:
returns the prediction error for each test.keepX
on each component, averaged across all repeats and subsampling folds. Standard deviation is also output. All error rates are also available as a list.
returns the number of variables selected (optimal keepX) on each component.
For supervised models; returns the optimal number of components for the model for each prediction distance using one-sided t-tests that test for a significant difference in the mean error rate (gain in prediction) when components are added to the model. See more details in Rohart et al 2017 Suppl. For more than one block, an optimal ncomp is returned for each prediction framework.
returns the error rate for each level of Y
and for each component computed with the optimal keepX
Prediction values for each sample, each test.keepX
, each comp and each repeat. Only if light.output=FALSE
Predicted class for each sample, each test.keepX
, each comp and each repeat. Only if light.output=FALSE
AUC mean and standard deviation if the number of categories in Y
is greater than 2, see details above. Only if auc = TRUE
only if multilevel analysis with 2 factors: correlation between latent variables.
The tune
function called the function predict
. more details about most arguments are detailed in ?predict
.
Also see the help file corresponding to your method
, e.g. tune.splsda
.
Note that only the arguments used in the tune function corresponding to method
are passed on.
Some details on the use of the nrepeat argument are provided in ?perf
.
More details about the prediction distances in ?predict
and the supplemental material of the mixOmics article (Rohart et al. 2017). More details about the PLS modes are in ?pls
.
DIABLO:
Singh A., Gautier B., Shannon C., Vacher M., Rohart F., Tebbutt S. and L<U+00EA> Cao K.A. (2016). DIABLO - multi omics integration for biomarker discovery.
mixOmics article:
Rohart F, Gautier B, Singh A, L<U+00EA> Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752
MINT:
Rohart F, Eslami A, Matigian, N, Bougeard S, L<U+00EA> Cao K-A (2017). MINT: A multivariate integrative approach to identify a reproducible biomarker signature across multiple experiments and platforms. BMC Bioinformatics 18:128.
PLS and PLS citeria for PLS regression: Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic.
Chavent, Marie and Patouille, Brigitte (2003). Calcul des coefficients de regression et du PRESS en regression PLS1. Modulad n, 30 1-11. (this is the formula we use to calculate the Q2 in perf.pls and perf.spls)
Mevik, B.-H., Cederkvist, H. R. (2004). Mean Squared Error of Prediction (MSEP) Estimates for Principal Component Regression (PCR) and Partial Least Squares Regression (PLSR). Journal of Chemometrics 18(9), 422-429.
sparse PLS regression mode:
L<U+00EA> Cao, K. A., Rossouw D., Robert-Granie, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating Omics data. Statistical Applications in Genetics and Molecular Biology 7, article 35.
One-sided t-tests (suppl material):
Rohart F, Mason EA, Matigian N, Mosbergen R, Korn O, Chen T, Butcher S, Patel J, Atkinson K, Khosrotehrani K, Fisk NM, L<U+00EA> Cao K-A&, Wells CA& (2016). A Molecular Classification of Human Mesenchymal Stromal Cells. PeerJ 4:e1845.
tune.rcc
, tune.mint.splsda
,
tune.pca
,
tune.splsda
, tune.splslevel
and http://www.mixOmics.org for more details.
# NOT RUN {
## sPLS-DA
# }
# NOT RUN {
data(breast.tumors)
X <- breast.tumors$gene.exp
Y <- as.factor(breast.tumors$sample$treatment)
tune= tune(method = "splsda", X, Y, ncomp=1, nrepeat=10, logratio="none",
test.keepX = c(5, 10, 15), folds=10, dist="max.dist", progressBar = TRUE)
plot(tune)
# }
# NOT RUN {
## mint.splsda
# }
# NOT RUN {
data(stemcells)
data = stemcells$gene
type.id = stemcells$celltype
exp = stemcells$study
out = tune(method="mint.splsda", X=data,Y=type.id, ncomp=2, study=exp, test.keepX=seq(1,10,1))
out$choice.keepX
plot(out)
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab