opls: PCA, PLS(-DA), and OPLS(-DA)

Description

PCA, PLS, and OPLS regression, classification, and cross-validation with the NIPALS algorithm

Usage

"opls"(x, ...)
"opls"(x,
y = NULL,
predI = NA,
orthoI = 0,
algoC = c("default", "nipals", "svd")[1],
crossvalI = 7,
log10L = FALSE,
permI = 20,
scaleC = c("none", "center", "pareto", "standard")[4],
subset = NULL,
printL = TRUE,
plotL = TRUE,
.sinkC = NULL,
...)

Arguments

Numerical data frame or matrix (observations x variables); NAs are allowed

Response to be modelled: Either 1) 'NULL' for PCA (default) or 2) a numerical vector (same length as 'x' row number) for single response (O)PLS, or 3) a numerical matrix (same row number as 'x') for multiple response PLS or 4) a factor (same length as 'x' row number) for (O)PLS-DA. Note that, for convenience, character vectors are also accepted for (O)PLS-DA as well as single column numerical (resp. character) matrices for (O)PLS (respectively (O)PLS-DA). NAs are allowed in numeric responses.

predI

Integer: number of components (predictive componenents in case of PLS and OPLS) to extract; for OPLS, predI is (automatically) set to 1; if set to NA [default], autofit is performed: a maximum of 10 components are extracted until (i) PCA case: the variance is less than the mean variance of all components (note that this rule requires all components to be computed and can be quite time-consuming for large datasets) or (ii) PLS case: either R2Y of the component is < 0.01 (N4 rule) or Q2Y is < 0 (for more than 100 observations) or 0.05 otherwise (R1 rule)

orthoI

Integer: number of orthogonal components (for OPLS only); when set to 0 [default], PLS will be performed; otherwise OPLS will be peformed; when set to NA, OPLS is performed and the number of orthogonal components is automatically computed by using the cross-validation (with a maximum of 9 orthogonal components).

algoC

Default algorithm is 'svd' for PCA (in case of no missing values in 'x'; 'nipals' otherwise) and 'nipals' for PLS and OPLS; when asking to use 'svd' for PCA on an 'x' matrix containing missing values, NAs are set to half the minimum of non-missing values and a warning is generated

crossvalI

Integer: number of cross-validation segments (default is 7); The number of samples (rows of 'x') must be at least >= crossvalI

log10L

Should the 'x' matrix be log10 transformed? Zeros are set to 1 prior to transformation

permI

Integer: number of random permutations of response labels to estimate R2Y and Q2Y significance by permutation testing [default is 20 for single response models (without train/test partition), and 0 otherwise]

scaleC

Character: either no centering nor scaling ('none'), mean-centering only ('center'), mean-centering and pareto scaling ('pareto'), or mean-centering and unit variance scaling ('standard') [default]

subset

Integer vector: indices of the observations to be used for training (in a classification scheme); use NULL [default] for no partition of the dataset; use 'odd' for a partition of the dataset in two equal sizes (with respect to the classes proportions)

printL

Logical: Should informations regarding the data set and the model be printed? [default = TRUE]

plotL

Logical: Should the 'summary' plot be displayed? [default = TRUE]

.sinkC

Character: Name of the file for R output diversion [default = NULL: no diversion]; Diversion of messages is required for the integration into Galaxy

...

Currently not used.

Value

typeC: Character: model type (PCA, PLS, PLS-DA, OPLS, or OPLS-DA)
descriptionMC: Character matrix: Description of the data set (number of samples, variables, etc.)
modelDF: Data frame with the model overview (number of components, R2X, R2X(cum), R2Y, R2Y(cum), Q2, Q2(cum), significance, iterations)
summaryDF: Data frame with the model summary (cumulated R2X, R2Y and Q2); RMSEE is the square root of the mean error between the actual and the predicted responses
subsetVi: Integer vector: Indices of observations in the training data set
pcaVarVn: PCA: Numerical vector of variances of length: predI
vipVn: PLS(-DA): Numerical vector of Variable Importance in Projection; OPLS(-DA): Numerical vector of Variable Importance for Prediction (VIP4,p from Galindo-Prieto et al, 2014)
orthoVipVn: OPLS(-DA): Numerical vector of Variable Importance for Orthogonal Modeling (VIP4,o from Galindo-Prieto et al, 2014)
xMeanVn: Numerical vector: variable means of the 'x' matrix
xSdVn: Numerical vector: variable standard deviations of the 'x' matrix
yMeanVn: (O)PLS: Numerical vector: variable means of the 'y' response (transformed into a dummy matrix in case it is of 'character' mode initially)
ySdVn: (O)PLS: Numerical vector: variable standard deviations of the 'y' response (transformed into a dummy matrix in case it is of 'character' mode initially)
xZeroVarVi: Numerical vector: indices of variables with variance < 2.22e-16 which were excluded from 'x' before building the model
scoreMN: Numerical matrix of x scores (T; dimensions: nrow(x) x predI) X = TP' + E; Y = TC' + F
loadingMN: Numerical matrix of x loadings (P; dimensions: ncol(x) x predI) X = TP' + E
weightMN: (O)PLS: Numerical matrix of x weights (W; same dimensions as loadingMN)
orthoScoreMN: OPLS: Numerical matrix of orthogonal scores (Tortho; dimensions: nrow(x) x number of orthogonal components)
orthoLoadingMN: OPLS: Numerical matrix of orthogonal loadings (Portho; dimensions: ncol(x) x number of orthogonal components)
orthoWeightMN: OPLS: Numerical matrix of orthogonal weights (same dimensions as orthoLoadingMN)
cMN: (O)PLS: Numerical matrix of Y weights (C; dimensions: number of responses or number of classes in case of qualitative response) x number of predictive components; Y = TC' + F
coMN:: (O)PLS: Numerical matrix of Y orthogonal weights; dimensions: number of responses or number of classes in case of qualitative response with more than 2 classes x number of orthogonal components
uMN: (O)PLS: Numerical matrix of Y scores (U; same dimensions as scoreMN); Y = UC' + G
weightStarMN: Numerical matrix of projections (W*; same dimensions as loadingMN); whereas columns of weightMN are derived from successively deflated matrices, columns of weightStarMN relate to the original 'x' matrix: T = XW*; W*=W(P'W)inv
suppLs: List of additional objects to be used internally by the 'print', 'plot', and 'predict' methods

References

Eriksson et al. (2006). Multi- and Megarvariate Data Analysis. Umetrics Academy. Rosipal and Kramer (2006). Overview and recent advances in partial least squares Tenenhaus (1990). La regression PLS : theorie et pratique. Technip. Wehrens (2011). Chemometrics with R. Springer. Wold et al. (2001). PLS-regression: a basic tool of chemometrics

Examples

Run this code


#### PCA

data(foods) ## see Eriksson et al. (2001); presence of 3 missing values (NA)
head(foods)
foodMN <- as.matrix(foods[, colnames(foods) != "Country"])
rownames(foodMN) <- foods[, "Country"]
head(foodMN)
foo.pca <- opls(foodMN)

#### PLS with a single response

data(cornell) ## see Tenenhaus, 1998
head(cornell)
cornell.pls <- opls(as.matrix(cornell[, grep("x", colnames(cornell))]),
                    cornell[, "y"])

## Complementary graphics

plot(cornell.pls, typeVc = c("outlier", "predict-train", "xy-score", "xy-weight"))

#### PLS with multiple (quantitative) responses

data(lowarp) ## see Eriksson et al. (2001); presence of NAs
head(lowarp)
lowarp.pls <- opls(as.matrix(lowarp[, c("glas", "crtp", "mica", "amtp")]),
                   as.matrix(lowarp[, grepl("^wrp", colnames(lowarp)) |
                                      grepl("^st", colnames(lowarp))]))

#### PLS-DA

data(sacurine)
attach(sacurine)
sacurine.plsda <- opls(dataMatrix, sampleMetadata[, "gender"])

#### OPLS-DA

sacurine.oplsda <- opls(dataMatrix, sampleMetadata[, "gender"], predI = 1, orthoI = NA)

detach(sacurine)

Run the code above in your browser using DataLab