By splitting data into training and test data repeatedly the number of principal components can be determined by inspecting the distribution of the explained variances.
pcaCV(X, amax, center = TRUE, scale = TRUE, repl = 50, segments = 4,
segment.type = c("random", "consecutive", "interleaved"), length.seg, trace = FALSE,
plot.opt = TRUE, ...)
matrix with explained variances, repl rows, and amax columns
matrix with MSEP values, repl rows, and amax columns
numeric data frame or matrix
maximum number of components for evaluation
should the data be centered? TRUE or FALSE
should the data be scaled? TRUE or FALSE
number of replications of the CV procedure
number of segments for CV
"random", "consecutive", "interleaved" splitting into training and test data
number of parts for training and test data, overwrites segments
if TRUE intermediate results are reported
if TRUE the results are shown by boxplots
additional graphics parameters, see par
Peter Filzmoser <P.Filzmoser@tuwien.ac.at>
For cross validation the data are split into a number of segments, PCA is computed (using 1 to amax components) for all but one segment, and the scores of the segment left out are calculated. This is done in turn, by omitting each segment one time. Thus, a complete score matrix results for each desired number of components, and the error martrices of fit can be computed. A measure of fit is the explained variance, which is computed for each number of components. Then the whole procedure is repeated (repl times), which results in repl numbers of explained variance for 1 to amax components, i.e. a matrix. The matrix is presented by boxplots, where each boxplot summarized the explained variance for a certain number of principal components.
K. Varmuza and P. Filzmoser: Introduction to Multivariate Statistical Analysis in Chemometrics. CRC Press, Boca Raton, FL, 2009.
data(glass)
x.sc <- scale(glass)
resv <- clvalidity(x.sc,clnumb=c(2:5))
Run the code above in your browser using DataLab