SPC.cv: Perform cross-validation on sparse principal component analysis

Description

Selects tuning parameter for the sparse principal component analysis method of Witten, Tibshirani, and Hastie (2008), which involves applying PMD to a data matrix with lasso ($L_1$) penalty on the columns and no penalty on the rows. The tuning parameter controls the sum of absolute values - or $L_1$ norm - of the elements of the sparse principal component.

Usage

SPC.cv(x, sumabsvs=seq(1.2, 5,len=10), nfolds=5, niter=5, v=NULL,
trace=TRUE, orth=FALSE, center=TRUE, vpos=FALSE, vneg=FALSE)

Arguments

Data matrix of dimension $n x p$, which can contain NA for missing values. We are interested in finding sparse principal components of dimension $p$.

sumabsvs

Range of sumabsv values to be considered in cross-validation. Sumabsv is the sum of absolute values of elements of v. It must be between 1 and square root of number of columns of data. The smaller it is, the sparser v will be.

nfolds

Number of cross-validation folds performed.

niter

How many iterations should be performed. By default, perform only 5 for speed reasons.

The first right singular vector(s) of the data. (If missing data is present, then the missing values are imputed before the singular vectors are calculated.) v is used as the initial value for the iterative PMD($L_1$, $L_1$) algorithm. If x is

trace

Print out progress as iterations are performed? Default is TRUE.

orth

If TRUE, then use method of Section 3.2 of Witten, Tibshirani and Hastie (2008) to obtain multiple sparse principal components. Default is FALSE.

center

Subtract out mean of x? Default is TRUE

vpos

Constrain elements of v to be positive? Default is FALSE.

vneg

Constrain elements of v to be negative? Default is FALSE.

Value

cvAverage sum of squared errors that results for each tuning parameter value.
cv.errorStandard error of the average sum of squared error that results for each tuning parameter value.
bestsumabsvValue of sumabsv that resulted in lowest CV error.
nonzerovsAverage number of non-zero elements of v for each candidate value of sumabsvs.
v.initInitial value of v that was passed in. Or, if that was NULL, then first right singular vector of X.
bestsumabsv1seThe smallest value of sumabsv that is within 1 standard error of smallest CV error.

Details

This method only performs cross-validation for the first sparse principal component. It does so by performing the following steps nfolds times: (1) replace a fraction of the data with missing values, (2) perform SPC on this new data matrix using a range of tuning parameter values, each time getting a rank-1 approximationg $udv'$ where $v$ is sparse, (3) measure the mean squared error of the rank-1 estimate of the missing values created in step 1.

Then, the selected tuning parameter value is that which resulted in the lowest average mean squared error in step 3.

In order to perform cross-validation for the second sparse principal component, apply this function to $X-udv'$ where $udv'$ are the output of running SPC on the raw data $X$.

References

Witten, DM and Tibshirani, R and T Hastie (2008) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Submitted.

Examples

Run this code

# A simple simulated example
set.seed(1)
u <- matrix(c(rnorm(50), rep(0,150)),ncol=1)
v <- matrix(c(rnorm(75),rep(0,225)), ncol=1)
x <- u%*%t(v)+matrix(rnorm(200*300),ncol=300)
# Perform Sparse PCA - that is, decompose a matrix w/o penalty on rows
# and w/ L1 penalty on columns
# First, we perform sparse PCA and get 4 components, but we do not
# require subsequent components to be orthogonal to previous components
cv.out <- SPC.cv(x, sumabsvs=seq(1.2, sqrt(ncol(x)), len=6))
print(cv.out)
plot(cv.out)
out <- SPC(x,sumabsv=cv.out$bestsumabs, K=4) # could use
# cv.out$bestsumabvsv1se instead
print(out,verbose=TRUE)
# Now, we do sparse PCA using method in Section 3.2 of WT&H(2008) for getting
# multiple components - that is, we require components to be orthogonal
cv.out <- SPC.cv(x, sumabsvs=seq(1.2, sqrt(ncol(x)), len=6), orth=TRUE)
print(cv.out)
plot(cv.out)
out.orth <- SPC(x,sumabsv=cv.out$bestsumabsv, K=4, orth=TRUE)
print(out.orth,verbose=TRUE)
par(mfrow=c(1,1))
plot(out$u[,1], out.orth$u[,1], xlab="", ylab="")

Run the code above in your browser using DataLab