tune.multilevel: Tuning functions for multilevel analyses

Description

These functions were implemented to help tuning the variable selection parameters in the multilevel analyses.

Usage

tune.multilevel(X, 
Y,
multilevel , 
ncomp = 1, 
test.keepX = c(5, 10, 15), 
test.keepY = NULL, 
already.tested.X = NULL, 
already.tested.Y = NULL, 
method,
mode = "regression",
validation = "Mfold",
folds = 10,
dist = "max.dist",
measure = "BER",
auc = FALSE,
progressBar = TRUE,
near.zero.var = FALSE,
logratio = "none",
nrepeat=1,
light.output = TRUE)

Arguments

numeric matrix of predictors. NAs are allowed.

if(method = 'spls') numeric vector or matrix of continuous responses (for multi-response models) NAs are allowed.

multilevel

Design matrix for multilevel analysis (for repeated measurements). A numeric matrix or data frame. For a one level factor decomposition, the input is a vector indicating the repeated measures on each individual, i.e. the individuals ID. For a two level decomposition with splsda models, the two factors are included in Y. Finally for a two level decomposition with spls models, 2nd AND 3rd columns in design indicate those factors (see example in ?splsda and ?spls).

ncomp

the number of components to include in the model.

test.keepX

numeric vector for the different number of variables to test from the \(X\) data set

test.keepY

If method = 'spls', numeric vector for the different number of variables to test from the \(Y\) data set

already.tested.X

if ncomp > 1 numeric vector indicating the number of variables to select rom the \(X\) data set on the pfirst components

already.tested.Y

if method = 'spls' and if(ncomp > 1) numeric vector indicating the number of variables to select from the \(Y\) data set on the first components

method

character string. Which multivariate method and type of analysis to choose, matching one of 'splsda' (Discriminant Analysis) or 'spls' (unsupervised integrative analysis). See Details.

mode

character string. What type of algorithm to use, (partially) matching one of "regression", "canonical", "invariant" or "classic". See Details.

validation

character. What kind of (internal) validation to use, matching one of "Mfold" or "loo" (see below). Default is "Mfold".

folds

the folds in the Mfold cross-validation. See Details.

dist

distance metric to use for splsda to estimate the classification error rate, should be a subset of "centroids.dist", "mahalanobis.dist" or "max.dist" (see Details).

measure

Two misclassification measure are available: overall misclassification error overall or the Balanced Error Rate BER

auc

if TRUE calculate the Area Under the Curve (AUC) performance of the model. Only used for method='splsda'

progressBar

by default set to TRUE to output the progress bar of the computation.

near.zero.var

boolean, see the internal nearZeroVar function (should be set to TRUE in particular for data with many zero values). Default value is FALSE

logratio

one of ('none','CLR'). Default to 'none'

nrepeat

Number of times the Cross-Validation process is repeated.

light.output

if set to FALSE, the prediction/classification of each sample for each of test.keepX and each comp is returned.

Value

Depending on the type of analysis performed, a list that may contain:

error.rate

returns the prediction error for each test.keepX on each component, averaged across all repeats and subsampling folds. Standard deviation is also output. All error rates are also available as a list.

choice.keepX

returns the number of variables selected (optimal keepX) on each component.

choice.nconp

returns the optimal number of components in the sPLS-DA model

error.rate.class

returns the error rate for each level of Y and for each component computed with the optimal keepX

predict

Prediction values for each sample, each test.keepX, each comp and each repeat. Only if light.output=FALSE

class

Predicted class for each sample, each test.keepX, each comp and each repeat. Only if light.output=FALSE

auc

AUC mean and standard deviation if the number of categories in Y is greater than 2, see details above. Only if auc = TRUE

cor.value

only if multilevel analysis with 2 factors: correlation between latent variables.

Details

This tuning function should be used to tune the parameters when using a variance decomposition ('multilevel') with a repeated measurement design, see also details in ?predict.splsda.

If method = 'splsda', a distance metric must be used, see help(predict.splsda) for details about the distances.

The function outputs the optimal number of components that achieve the best performance based on the overall error rate or BER. The assessment is data-driven and similar to the process detailed in (Rohart et al., 2016), where one-sided t-tests assess whether there is a gain in performance when adding a component to the model. Our experience has shown that in most case, the optimal number of components is the number of categories in Y - 1.

For a sPLS-DA one-factor analysis, M-fold cross-validation is performed, internally the training data is decomposed into within-subject variation.

For a sPLS-DA two-factor analysis, the correlation between components from the within-subject variation of X and a matrix including the two factors (design[,-1]) is computed on the whole data set. We cannot obtain a cross-validation error rate as for the spls-DA one-factor analysis because of the difficulty to decompose and predict the within matrices within each fold.

For a sPLS two-factor analysis a sPLS canonical mode is run, and the correlation between components from the within-subject variation of X and Y is computed on the whole data set.

If validation = "Mfold", M-fold cross-validation is performed. How many folds to generate is selected by specifying the number of folds in folds. The folds also can be supplied as a list of vectors containing the indexes defining each fold as produced by split.

If validation = "loo", leave-one-out cross-validation is performed. By default folds is set to the number of unique individuals.

If auc = TRUE and there are more than 2 categories in Y, the Area Under the Curve is averaged using one-vs-all comparison. Note however that the AUC criteria may not be particularly insightful as the prediction threshold we use in sPLS-DA differs from an AUC threshold (sPLS-DA relies on prediction distances for predictions, see details in ?predict.splsda).

More details about the prediction distances in ?predict. More details about the PLS modes in ?pls.

References

On multilevel analysis:

Liquet, B., Le Cao, K.-A., Hocini, H. and Thiebaut, R. (2012) A novel approach for biomarker selection and the integration of repeated measures experiments from two platforms. BMC Bioinformatics 13:325.

Westerhuis, J. A., van Velzen, E. J., Hoefsloot, H. C., and Smilde, A. K. (2010). Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics, 6(1), 119-128.

mixOmics manuscript:

Rohart F, Gautier B, Singh A, Le Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration.

Examples

Run this code

# NOT RUN {
## First example: one-factor analysis with sPLS-DA
# }
# NOT RUN {
  data(vac18.simulated) # simulated data
  design <- data.frame(sample = vac18.simulated$sample)
  
    result.ex1 = tune.multilevel(X = vac18.simulated$genes,
                                Y = vac18.simulated$stimulation,
                               multilevel = design,
                               ncomp=2,
                               test.keepX=c(5, 10, 15), 
                               already.tested.X = c(50),
                               method = 'splsda',
                               dist = 'mahalanobis.dist',
                               validation = 'loo') 
  
  # overall error rate
  result.ex1$error.rate 
  # classification error rate per class after 2 components
  result.ex1$error.rate.class
# }
# NOT RUN {


## Second example: two-factor analysis with sPLS-DA
# }
# NOT RUN {
  data(liver.toxicity)
  dose <- as.factor(liver.toxicity$treatment$Dose.Group)
  time <- as.factor(liver.toxicity$treatment$Time.Group)
  # note: we made up those data, pretending they are repeated measurements
  repeat.indiv <- c(1, 2, 1, 2, 1, 2, 1, 2, 3, 3, 4, 3, 4, 3, 4, 4, 5, 6, 5, 5,
                    6, 5, 6, 7, 7, 8, 6, 7, 8, 7, 8, 8, 9, 10, 9, 10, 11, 9, 9,
                    10, 11, 12, 12, 10, 11, 12, 11, 12, 13, 14, 13, 14, 13, 14,
                    13, 14, 15, 16, 15, 16, 15, 16, 15, 16)
  summary(as.factor(repeat.indiv)) # 16 rats, 4 measurements each
  
  design <- data.frame(sample = repeat.indiv)

  result.ex2 = tune.multilevel(liver.toxicity$gene,
                                Y = data.frame(dose, time),
                                multilevel = design,
                                ncomp=2,
                                test.keepX=c(5, 10, 15), 
                                already.tested.X = c(50),
                                method = 'splsda',
                                dist = 'mahalanobis.dist') 
  result.ex2
# }
# NOT RUN {
## Third example: one-factor integrative analysis with sPLS
# }
# NOT RUN {
  data(liver.toxicity)
  # note: we made up those data, pretending they are repeated measurements
  repeat.indiv <- c(1, 2, 1, 2, 1, 2, 1, 2, 3, 3, 4, 3, 4, 3, 4, 4, 5, 6, 5, 5,
                    6, 5, 6, 7, 7, 8, 6, 7, 8, 7, 8, 8, 9, 10, 9, 10, 11, 9, 9,
                    10, 11, 12, 12, 10, 11, 12, 11, 12, 13, 14, 13, 14, 13, 14,
                    13, 14, 15, 16, 15, 16, 15, 16, 15, 16)
  summary(as.factor(repeat.indiv)) # 16 rats, 4 measurements each
  
  # here we are only interested in a one level variation split since spls is an unsupervised method
  design <- data.frame(sample = repeat.indiv)
  
  result.ex3 = tune.multilevel(X = liver.toxicity$gene, Y = liver.toxicity$clinic, 
                                multilevel = design,
                                mode = 'canonical',
                                ncomp=2,
                                test.keepX=c(5, 10, 15), 
                                test.keepY=c(2,3), 
                                already.tested.X = c(50), already.tested.Y = c(5),
                                method = 'spls') 
  
  result.ex3

# }
# NOT RUN {
# }

Run the code above in your browser using DataLab