perf: Compute evaluation criteria for PLS, sPLS, PLS-DA and sPLS-DA

Description

Function to evaluate the performance of the fitted PLS, sparse PLS, PLS-DA and sparse PLS-DA models using various criteria.

Usage

## S3 method for class 'pls':
perf(object,validation = c("Mfold", "loo"),
           folds = 10, progressBar = TRUE, ...)	
## S3 method for class 'spls':
perf(object,validation = c("Mfold", "loo"),
          folds = 10, progressBar = TRUE, ...)
## S3 method for class 'plsda':
perf(object,
          method.predict = c("all", "max.dist", "centroids.dist", "mahalanobis.dist"),
          validation = c("Mfold", "loo"), 
          folds = 10, progressBar = TRUE, near.zero.var = FALSE, ...)	
## S3 method for class 'splsda':
perf(object, 
          method.predict = c("all", "max.dist", "centroids.dist", "mahalanobis.dist"),
          validation = c("Mfold", "loo"),  
          folds = 10, progressBar = TRUE, near.zero.var = FALSE, ...)

Arguments

object

object of class inheriting from "pls", "plsda", "spls" or "splsda". The function will retrieve some key parameters stored in that object.

method.predict

only applies to an object inheriting from "plsda" or "splsda" to evaluate the classification performance of the model. Should be a subset of "max.dist", "centroids.dist", "mahalanobis.dist".

validation

character. What kind of (internal) validation to use, matching one of "Mfold" or "loo" (see below). Default is "Mfold".

folds

the folds in the Mfold cross-validation. See Details.

progressBar

by default set to TRUE to output the progress bar of the computation.

near.zero.var

default is set to FALSE) in perf.plsda and perf.splsda. However, the nearZeroVar function is still applied by default on the whole data set at the start of the function. When set to TRUE, nearZeroVar is also applied on each cross

...

other arguments to pass to the s/PLS/DA functions.

Value

For PLS and sPLS models, perf produces a list with the following components:
MSEPMean Square Error Prediction for each $Y$ variable, only applies to object inherited from "pls", and "spls".
R2a matrix of $R^2$ values of the $Y$-variables for models with $1, \ldots ,$ncomp components, only applies to object inherited from "pls", and "spls".
Q2if $Y$ containts one variable, a vector of $Q^2$ values else a list with a matrix of $Q^2$ values for each $Y$-variable. Note that in the specific case of an sPLS model, it is better to have a look at the Q2.total criterion, only applies to object inherited from "pls", and "spls"
Q2.totala vector of $Q^2$-total values for models with $1, \ldots ,$ncomp components, only applies to object inherited from "pls", and "spls"
featuresa list of features selected across the folds ($stable.X and $stable.Y) for the keepX and keepY parameters from the input object.
error.rateFor PLS-DA and sPLS-DA models, perf produces a matrix of classification error rate estimation. The dimensions correspond to the components in the model and to the prediction method used, respectively. Note that error rates reported in any component include the performance of the model in earlier components for the specified keepX parameters (e.g. error rate reported for component 3 for keepX = 20 already includes the fitted model on components 1 and 2 for keepX = 20). For more advanced usage of the perf function, see www.mixomics.org/methods/spls-da/ and consider using the predict function.

encoding

latin1

Details

For fitted PLS and sPLS regression models, perf estimates the mean squared error of prediction (MSEP), $R^2$, and $Q^2$ to assess the predictive perfity of the model using M-fold or leave-one-out cross-validation. Note that only the classic, regression and invariant modes can be applied.

If validation = "Mfold", M-fold cross-validation is performed. How many folds to generate is selected by specifying the number of folds in folds. The folds also can be supplied as a list of vectors containing the indexes defining each fold as produced by split. When using validation = "Mfold", make sure that you repeat the process several times (as the results will be highly dependent on the random splits and the sample size).

If validation = "loo", leave-one-out cross-validation is performed (in that case, there is no need to repeat the process).

For fitted PLS-DA and sPLS-DA models, perf estimates the classification error rate using cross-validation.

For the sparse approaches (sPLS and sLDA), note that the perf function will retrieve the keepX and keepY inputs from the previously run object. The sPLS or sPLS-DA functions will then be run again on several and different subsets of data (the cross-folds) and will certainly lead different subset of selected features. Those are summarised in the output features$stable (see output Value below) to assess how often the variables are selected on across all folds.

For sPLS, the MSEP, $R^2$, and $Q^2$ criteria are averaged across all folds. For sPLS-DA, the classification erro rate is averaged across all folds.

References

Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic. (warning: do not not to use that definition anymore).

Chavent, Marie and Patouille, Brigitte (2003). Calcul des coefficients de r{e}gression et du PRESS en r{e}gression PLS1. Modulad n, 30 1-11. (this is the formula we use to calculate the Q2 in perf.pls and perf.spls)

Le Cao, K. A., Rossouw D., Robert-Granie, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating Omics data. Statistical Applications in Genetics and Molecular Biology 7, article 35.

Mevik, B.-H., Cederkvist, H. R. (2004). Mean Squared Error of Prediction (MSEP) Estimates for Principal Component Regression (PCR) and Partial Least Squares Regression (PLSR). Journal of Chemometrics 18(9), 422-429.

Examples

Run this code

## validation for objects of class 'pls' (regression)
# ----------------------------------------
  data(liver.toxicity)
  X <- liver.toxicity$gene
  Y <- liver.toxicity$clinic
  
  
  # try tune the number of component to choose
  # ---------------------
  # first learn the full model
  liver.pls <- pls(X, Y, ncomp = 10)
  
  # with 5-fold cross validation: we use the same parameters as in model above
  # but we perform cross validation to compute the MSEP, Q2 and R2 criteria
  # ---------------------------
  liver.val <- perf(liver.pls, validation = "Mfold", folds = 5)
  
  # Q2 total should decrease until it reaches a threshold
  liver.val$Q2.total
  
  # ncomp = 2 is enough
  plot(liver.val$Q2.total, type = 'l', col = 'red', ylim = c(-0.5, 0.5),
       xlab = 'PLS components', ylab = 'Q2 total')
  abline(h = 0.0975, col = 'darkgreen')
  legend('topright', col = c('red', 'darkgreen'), 
  legend = c('Q2 total', 'threshold 0.0975'), lty = 1)
  title('Liver toxicity PLS 5-fold, Q2 total values')
  
  #have a look at the other criteria
  # ----------------------
  # R2
  liver.val$R2
  matplot(t(liver.val$R2), type = 'l', xlab = 'PLS components', ylab = 'R2 for each variable')
  title('Liver toxicity PLS 5-fold, R2 values')
  
  # MSEP
  liver.val$MSEP
  matplot(t(liver.val$MSEP), type = 'l', xlab = 'PLS components', ylab = 'MSEP for each variable')
  title('Liver toxicity PLS 5-fold, MSEP values')
  
  
  ## validation for objects of class 'spls' (regression)
  # ----------------------------------------
  ncomp = 7
  # first, learn the model on the whole data set
  model.spls = spls(X, Y, ncomp = ncomp, mode = 'regression',
                    keepX = c(rep(10, ncomp)), keepY = c(rep(4,ncomp)))
  
  
  # with leave-one-out cross validation
  ##set.seed(45)
  model.spls.val <- perf(model.spls, validation = "Mfold", folds = 5 )#validation = "loo")
  
  #Q2 total
  model.spls.val$Q2.total
  
  # R2:we can see how the performance degrades when ncomp increases
  model.spls.val$R2
  plot(model.spls.val, criterion="R2", type = 'l')
  plot(model.spls.val, criterion="Q2", type = 'l')
  
  
  ## validation for objects of class 'splsda' (classification)
  # ----------------------------------------
  data(srbct)
  X <- srbct$gene
  Y <- srbct$class  
  
  ncomp = 5
  
  srbct.splsda <- splsda(X, Y, ncomp = ncomp, keepX = rep(10, ncomp))  
  
  # with Mfold
  # ---------
  set.seed(45)
  error <- perf(srbct.splsda, validation = "Mfold", folds = 8, 
                method.predict = "all")
  
  plot(error, type = "l")

Run the code above in your browser using DataLab