Learn R Programming

bootSVD (version 1.1)

bootSVD: Calculates bootstrap distribution of PCA (i.e. SVD) results

Description

Applies fast bootstrap PCA, using the method from (Fisher et al., 2014). Dimension of the sample is denoted by \(p\), and sample size is denoted by \(n\), with \(p>n\).

Usage

bootSVD(Y = NULL, K, V = NULL, d = NULL, U = NULL, B = 50,
  output = "HD_moments", verbose = getOption("verbose"), bInds = NULL,
  percentiles = c(0.025, 0.975), centerSamples = TRUE, pattern_V = "V_",
  pattern_Vb = "Vb_")

Arguments

Y

initial data sample, which can be either a matrix or a ff matrix. Y can be either tall (\(p\) by \(n\)) or wide (\(n\) by \(p\)). If Y is entered and V, d and U (see definitions below) are not entered, then bootSVD will also compute the SVD of Y. In this case where the SVD is computed, bootSVD will assume that the larger dimension of Y is \(p\), and the smaller dimension of code Y is \(n\) (i.e. bootSVD assumes that (\(p>n\)). This assumption can be overriden by manually entering V, U and d. For cases where the entire data matrix can be easily stored in memory (e.g. \(p<50000\)), it is generally appropriate to enter Y as a standard matrix. When Y is large enough that matrix algebra on Y is too demanding for memory though, Y should be entered as a ff object, where the actual data is stored on disk. If Y has class ff, and V, d or U is not entered, then block matrix algebra will be used to calculate the PCs and bootstrap PCs. The results of these calculations will be returned as ff objects as well.

K

number of PCs to calculate the bootstrap distribution for.

V

(optional) the (\(p\) by \(n\)) full matrix of \(p\)-dimensional PCs for the sample data matrix. If Y is wide, these are the right singular vectors of Y (i.e. \(Y=UDV'\)). If Y is tall, these are the left singular vectors of Y (i.e. \(Y=VDU'\)). In general it is assumed that \(p>n\), however, this can be overridden by setting V and U appropriately. Like Y, the argument V can be either a standard matrix or a ff matrix. If V is a ff object, the bootstrap PCs, if requested, will be returned as ff objects as well.

d

(optional) \(n\)-length vector of the singular values of Y. For example, if Y is tall, then we have \(Y=VDU'\) with D=diag(d).

U

(optional) the (\(n\) by \(n\)) full set of \(n\)-dimensional singular vectors of Y. If Y is wide, these are the left singular vectors of Y (i.e. \(Y=UDV'\)). If Y is tall, these are the right singular vectors of Y (i.e. \(Y=VDU'\)).

B

number of bootstrap samples to compute.

output

a vector telling which descriptions of the bootstrap distribution should be calculated. Can include any of the following: 'initial_SVD', 'HD_moments', 'full_HD_PC_dist', and 'HD_percentiles'. See below for explanations of these outputs. For especially high dimensional cases, caution should be used if requesting 'full_HD_PC_dist' due to potential storage limitations.

verbose

if TRUE, the function will print progress during calculation procedure.

bInds

a (\(B\) by \(n\)) matrix of bootstrap indeces, where B is the number of bootstrap samples, and n is the sample size. The purpose of setting a specific bootstrap sampling index is to allow the results to be more precisely compared against standard bootstrap PCA calculations. If entered, the bInds argument will override the B argument.

percentiles

a vector containing percentiles to be used to calculate element-wise percentiles across the bootstrap distribution (both across the distribution of \(p\)-dimensional components and the distribution of \(n\)-dimensional components). For example, percentiles=c(.025,.975) will return the 2.5 and 97.5 percentiles, which can be used as \(95\) percent bootstrap percentile CIs. Alternatively, a longer vector of percentiles can be entered.

centerSamples

whether each bootstrap sample should be centered before calculating the SVD.

pattern_V

if Y is a class ff object, then the returned PCs of Y will also be a class ff object. pattern_V is passed to ff in creation of the initial_SVD output. Specifically, pattern_V is a filename prefix used for storing the high dimensional PCs of the original sample.

pattern_Vb

if Y or V is a class ff object, then the returned bootstrap PCs will also be class ff objects. pattern_Vb is passed to ff in creation of the full_HD_PC_dist output. Specifically, pattern_Vb is a filename prefix used for storing the high dimensional bootstrap PCs.

Value

bootSVD returns a list that can include any of the following elements, depending on what is specified in the output argument:

initial_SVD

The singular value decomposition of the centered, original data matrix. initial_SVD is a list containing V, the matrix of \(p\)-dimensional principal components, d, the vector of singular values of Y, and U, the matrix of \(n\)-dimensional singular vectors of Y.

HD_moments

A list containing the bootstrap expected value (EPCs), element-wise bootstrap variance (varPCs), and element-wise bootstrap standard deviation (sdPCs) for each of the \(p\)-dimensional PCs. Each of these three elements of HD_moments is also a list, which contains \(K\) vectors, one for each PC. HD_moments also contains momentCI, a \(K\)-length list of (\(p\) by 2) matrices containing element-wise moment based confidence intervals for the PCs.

full_HD_PC_dist

A \(B\)-length list of matrices (or ff matrices), with the \(b^{th}\) list element equal to the (\(p\) by \(K\)) matrix of high dimensional PCs for the \(b^{th}\) bootstrap sample. For especially high dimensional cases when the output is returned as ff matrices, caution should be used if requesting 'full_HD_PC_dist' due to potential storage limitations. To reindex these PCs by \(k\) (the PC index) as opposed to \(b\) (the bootstrap index), see reindexMatricesByK. Again though, caution shoulded be used when reindexing PCs stored as ff objects, as this will double the number of files stored.

HD_percentiles

A list of \(K\) matrices, each of dimension (\(p\) by \(q\)), where \(q\) is the number of percentiles requested (i.e. \(q\) = length(percentiles)). The \(k^{th}\) matrix in HD_percentiles contains element-wise percentiles for the \(k^{th}\), \(p\)-dimensional PC.

In addition, the following results are always included in the output, regardless of what is specified in the output argument:

full_LD_PC_dist

A \(B\)-length list of matrices, with the \(b^{th}\) list element equal to the (\(p\) by \(K\)) matrix of PCs of the scores in the \(b^{th}\) bootstrap sample. To reindex these vectors by \(k\) (the PC index), see reindexMatricesByK.

d_dist

A B-length list of vectors, with the \(b^{th}\) element of d_dist containing the \(n\)-length vector of singular values from the \(b^{th}\) bootstrap sample. To reindex these values by \(k\) (the PC index), see reindexVectorsByK.

U_dist

A B-length list of (\(n\) by \(K\)) matrices, with the columns of the \(b^{th}\) matrix containing the \(n\)-length singular vectors from the \(b^{th}\) bootstrap sample. To reindex these vectors by \(k\) (the PC index), see reindexMatricesByK.

LD_moments

A list that is comparable to HD_moments, but that instead describes the variability of the \(n\)-dimensional principal components of the resampled score matrices. LD_moments contains the bootstrap expected value (EPCs), element-wise bootstrap variances (varPCs), and element-wise bootstrap standard deviations (sdPCs) for each of the \(n\)-dimensional PCs. Each of these three elements of LD_moments is also a list, which contains \(K\) vectors, one for each PC. LD_moments also contains momentCI, a list of \(K\) (\(n\) by 2) matrices containing element-wise, moment-based confidence intervals for the PCs.

LD_percentiles

A list of \(K\) matrices, each of dimension (\(p\) by \(q\)), where \(q\) is the number of percentiles requested (i.e. \(q\) = length(percentiles)). The \(k^{th}\) matrix in LD_percentiles contains element-wise percentiles for the \(k^{th}\) \(n\)-dimensional PC.

Details

Users might also consider changing the global options ffbatchbytes, from the ff package, and mc.cores, from the parallel package. When ff objects are entered as arguments for bootSVD, the required matrix algebra is done using block matrix alebra. The ffbatchbytes option determines the size of the largest block matrix that will be held in memory at any one time. The mc.cores option (set to 1 by default) determines the level of parallelization to use when calculating the high dimensional distribution of the bootstrap PCs (see mclapply).

References

Aaron Fisher, Brian Caffo, and Vadim Zipunnikov. Fast, Exact Bootstrap Principal Component Analysis for p>1 million. 2014. http://arxiv.org/abs/1405.0922

Examples

Run this code
# NOT RUN {
#use small n, small B, for a quick illustration
set.seed(0)
Y<-simEEG(n=100, centered=TRUE, wide=TRUE) 
b<-bootSVD(Y, B=50, K=2, output= 
 	c('initial_SVD', 'HD_moments', 'full_HD_PC_dist',
 	'HD_percentiles'), verbose=interactive())
b

#explore results
matplot(b$initial_SVD$V[,1:4],type='l',main='Fitted PCs',lty=1)
legend('bottomright',paste0('PC',1:4),col=1:4,lty=1,lwd=2)

######################
# look specifically at 2nd PC
k<-2

######
#looking at HD variability

#plot several draws from bootstrap distribution
VsByK<-reindexMatricesByK(b$full_HD_PC_dist)
matplot(t(VsByK[[k]][1:20,]),type='l',lty=1,
		main=paste0('20 Draws from bootstrap\ndistribution of HD PC ',k))

#plot pointwise CIs
matplot(b$HD_moments$momentCI[[k]],type='l',col='blue',lty=1,
		main=paste0('CIs For HD PC ',k))
matlines(b$HD_percentiles[[k]],type='l',col='darkgreen',lty=1)
lines(b$initial_SVD$V[,k])
legend('topright',c('Fitted PC','Moment CIs','Percentile CIs'),
		lty=1,col=c('black','blue','darkgreen'))
abline(h=0,lty=2,col='darkgrey')

######
# looking at LD variability

# plot several draws from bootstrap distribution
AsByK<-reindexMatricesByK(b$full_LD_PC_dist)
matplot(t(AsByK[[k]][1:50,]),type='l',lty=1,
		main=paste0('50 Draws from bootstrap\ndistribution of LD PC ',k),
	xlim=c(1,10),xlab='PC index (truncated)')

# plot pointwise CIs
matplot(b$LD_moments$momentCI[[k]],type='o',col='blue',
		lty=1,main=paste0('CIs For LD PC ',k),xlim=c(1,10),
		xlab='PC index (truncated)',pch=1)
matlines(b$LD_percentiles[[k]],type='o',pch=1,col='darkgreen',lty=1)
abline(h=0,lty=2,col='darkgrey')
legend('topright',c('Moment CIs','Percentile CIs'),lty=1,
		pch=1,col=c('blue','darkgreen'))
#Note: variability is mostly due to rotations with the third and fourth PC.

# Bootstrap eigenvalue distribution
dsByK<-reindexVectorsByK(b$d_dist)
boxplot(dsByK[[k]]^2,main=paste0('Covariance Matrix Eigenvalue ',k),
		ylab='Bootstrap Distribution',
		ylim=range(c(dsByK[[k]]^2,b$initial_SVD$d[k]^2)))
points(b$initial_SVD$d[k]^2,pch=18,col='red')
legend('bottomright','Sample Value',pch=18,col='red')


##################
#Example with ff input
library(ff)
Yff<-as.ff(Y, pattern='Y_')
# If desired, change options in 'ff' package to
# adjust the size of matrix blocks held in RAM.
# For example:
# options('ffbatchbytes'=100000)
ff_dir<-tempdir()
pattern_V <- paste0(ff_dir,'/V_')
pattern_Vb <- paste0(ff_dir,'/Vb_')
bff <- bootSVD(Yff, B=50, K=2, output=c('initial_SVD', 'HD_moments',
 	'full_HD_PC_dist', 'HD_percentiles'), pattern_V= pattern_V,
 	pattern_Vb=pattern_Vb, verbose=interactive())


# Note that elements of full_HD_PC_dist and initial_SVD
# have class 'ff'
str(lapply(bff,function(x) class(x[[1]])))
#Show some results of bootstrap draws
plot(bff$full_HD_PC_dist[[1]][,k],type='l')
#Reindexing by K will create a new set of ff files.
VsByKff<-reindexMatricesByK(bff$full_HD_PC_dist,
 	pattern=paste0(ff_dir,'/Vk_'))
physical(bff$full_HD_PC_dist[[1]])$filename
physical(VsByKff[[1]])$filename
matplot(t(VsByKff[[k]][1:10,]),type='l',lty=1,
main=paste0('Bootstrap Distribution of PC',k))


# Saving and moving results:
saveRDS(bff,file=paste0(ff_dir,'/bff.rds'))
close(bff$initial_SVD$V)
physical(bff$initial_SVD$V)$filename
# If the 'ff' files on disk are moved or renamed,
# this filename attribute can be changed:
old_ff_path <- physical(bff$initial_SVD$V)$filename
new_ff_path <- paste0(tempdir(),'/new_V_file.ff')
file.rename(from= old_ff_path, to= new_ff_path)
physical(bff$initial_SVD$V)$filename <- new_ff_path
matplot(bff$initial_SVD$V[,1:4],type='l',lty=1)

# }

Run the code above in your browser using DataLab