Learn R Programming

mdatools (version 0.14.2)

pca: Principal Component Analysis

Description

pca is used to build and explore a principal component analysis (PCA) model.

Usage

pca(
  x,
  ncomp = min(nrow(x) - 1, ncol(x), 20),
  center = TRUE,
  scale = FALSE,
  exclrows = NULL,
  exclcols = NULL,
  x.test = NULL,
  method = "svd",
  rand = NULL,
  lim.type = "ddmoments",
  alpha = 0.05,
  gamma = 0.01,
  info = ""
)

Value

Returns an object of pca class with following fields:

ncomp

number of components included to the model.

ncomp.selected

selected (optimal) number of components.

loadings

matrix with loading values (nvar x ncomp).

eigenvals

vector with eigenvalues for all existent components.

expvar

vector with explained variance for each component (in percent).

cumexpvar

vector with cumulative explained variance for each component (in percent).

T2lim

statistical limit for T2 distance.

Qlim

statistical limit for Q residuals.

info

information about the model, provided by user when build the model.

calres

an object of class pcares with PCA results for a calibration data.

testres

an object of class pcares with PCA results for a test data, if it was provided.

More details and examples can be found in the Bookdown tutorial.

Arguments

x

calibration data (matrix or data frame).

ncomp

maximum number of components to calculate.

center

logical, do mean centering of data or not.

scale

logical, do standardization of data or not.

exclrows

rows to be excluded from calculations (numbers, names or vector with logical values)

exclcols

columns to be excluded from calculations (numbers, names or vector with logical values)

x.test

test data (matrix or data frame).

method

method to compute principal components ("svd", "nipals").

rand

vector with parameters for randomized PCA methods (if NULL, conventional PCA is used instead)

lim.type

which method to use for calculation of critical limits for residual distances (see details)

alpha

significance level for extreme limits for T2 and Q disances.

gamma

significance level for outlier limits for T2 and Q distances.

info

a short text with model description.

Author

Sergey Kucheryavskiy (svkucheryavski@gmail.com)

Details

Note, that from v. 0.10.0 cross-validation is no more supported in PCA.

If number of components is not specified, a minimum of number of objects - 1 and number of variables in calibration set is used. One can also specified an optimal number of component, once model is calibrated (ncomp.selected). The optimal number of components is used to build a residuals distance plot, as well as for SIMCA classification.

If some of rows of calibration set should be excluded from calculations (e.g. because they are outliers) you can provide row numbers, names, or logical values as parameter exclrows. In this case they will be completely ignored we model is calibrated. However, score and residuls distances will be computed for these rows as well and then hidden. You can show them on corresponding plots by using parameter show.excluded = TRUE.

It is also possible to exclude selected columns from calculations by provideing parameter exclcols in form of column numbers, names or logical values. In this case loading matrix will have zeros for these columns. This allows to compute PCA models for selected variables without removing them physically from a dataset.

Take into account that if you see other packages to make plots (e.g. ggplot2) you will not be able to distinguish between hidden and normal objects.

By default loadings are computed for the original dataset using either SVD or NIPALS algorithm. However, for datasets with large number of rows (e.g. hyperspectral images), there is a possibility to run algorithms based on random permutations [1, 2]. In this case you have to define parameter rand as a vector with two values: p - oversampling parameter and k - number of iterations. Usually rand = c(15, 0) or rand = c(5, 1) are good options, which give quite almost precise solution but much faster.

There are several ways to calculate critical limits for orthogonal (Q, q) and score (T2, h) distances. In mdatools you can specify one of the following methods via parameter lim.type: "jm" Jackson-Mudholkar approach [3], "chisq" - method based on chi-square distribution [4], "ddmoments" and "ddrobust" - related to data driven method proposed in [5]. The "ddmoments" is based on method of moments for estimation of distribution parameters (also known as "classical" approach) while "ddrobust" is based in robust estimation.

If lim.type="chisq" or lim.type="jm" is used, only limits for Q-distances are computed based on corresponding approach, limits for T2-distances are computed using Hotelling's T-squared distribution. The methods utilizing the data driven approach calculate limits for combination of the distances bases on chi-square distribution and parameters estimated from the calibration data.

The critical limits are calculated for a significance level defined by parameter 'alpha'. You can also specify another parameter, 'gamma', which is used to calculate acceptance limit for outliers (shown as dashed line on residual distance plot).

You can also recalculate the limits for existent model by using different values for alpha and gamme, without recomputing the model itself. In this case use the following code (it is assumed that you current PCA/SIMCA model is stored in variable m): m = setDistanceLimits(m, lim.type, alpha, gamma).

In case of PCA the critical limits are just shown on residual plot as lines and can be used for detection of extreme objects (solid line) and outliers (dashed line). When PCA model is used for classification in SIMCA (see simca) the limits are also employed for classification of objects.

References

1. N. Halko, P.G. Martinsson, J.A. Tropp. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53 (2010) pp. 217-288.

2. S. Kucheryavskiy, Blessing of randomness against the curse of dimensionality, Journal of Chemometrics, 32 (2018).

3. J.E. Jackson, A User's Guide to Principal Components, John Wiley & Sons, New York, NY (1991).

4. A.L. Pomerantsev, Acceptance areas for multivariate classification derived by projection methods, Journal of Chemometrics, 22 (2008) pp. 601-609.

5. A.L. Pomerantsev, O.Ye. Rodionova, Concept and role of extreme objects in PCA/SIMCA, Journal of Chemometrics, 28 (2014) pp. 429-438.

See Also

Methods for pca objects:

plot.pcamakes an overview of PCA model with four plots.
summary.pcashows some statistics for the model.
categorize.pcacategorize data rows as "normal", "extreme" or "outliers".
selectCompNum.pcaset number of optimal components in the model
setDistanceLimits.pcaset critical limits for residuals
predict.pcaapplies PCA model to a new data.

Plotting methods for pca objects:

plotScores.pcashows scores plot.
plotLoadings.pcashows loadings plot.
plotVariance.pcashows explained variance plot.
plotCumVariance.pcashows cumulative explained variance plot.
plotResiduals.pcashows plot for residual distances (Q vs. T2).
plotBiplot.pcashows bi-plot.
plotExtreme.pcashows extreme plot.
plotT2DoFplot with degrees of freedom for score distance.
plotQDoFplot with degrees of freedom for orthogonal distance.
plotDistDoFplot with degrees of freedom for both distances.

Most of the methods for plotting data are also available for PCA results (pcares) objects. Also check pca.mvreplace, which replaces missing values in a data matrix with approximated using iterative PCA decomposition.

Examples

Run this code
library(mdatools)
### Examples for PCA class

## 1. Make PCA model for People data with autoscaling

data(people)
model = pca(people, scale = TRUE, info = "Simple PCA model")
model = selectCompNum(model, 4)
summary(model)
plot(model, show.labels = TRUE)

## 2. Show scores and loadings plots for the model

par(mfrow = c(2, 2))
plotScores(model, comp = c(1, 3), show.labels = TRUE)
plotScores(model, comp = 2, type = "h", show.labels = TRUE)
plotLoadings(model, comp = c(1, 3), show.labels = TRUE)
plotLoadings(model, comp = c(1, 2), type = "h", show.labels = TRUE)
par(mfrow = c(1, 1))

## 3. Show residual distance and variance plots for the model
par(mfrow = c(2, 2))
plotVariance(model, type = "h")
plotCumVariance(model, show.labels = TRUE, legend.position = "bottomright")
plotResiduals(model, show.labels = TRUE)
plotResiduals(model, ncomp = 2, show.labels = TRUE)
par(mfrow = c(1, 1))

Run the code above in your browser using DataLab