rpca: Randomized principal component analysis (PCA).

Description

Performs an approximated principal components analysis using randomized singular value decomposition.

Usage

rpca(A, k = NULL, center = TRUE, scale = TRUE, whiten = FALSE, retx = FALSE, svdalg = "auto", p = 5, q = 2, ...)

Arguments

array_like a numeric input matrix (or data frame), with dimensions $(m, n)$. If the data contain $NA$s na.omit is applied.

int, optional determines the number of principle components to compute. It is required that $k$ is smaller or equal to $n$, but it is recommended that $k << min(m,n)$.

center

bool ($TRUE$, $FALSE$), optional a logical value ($TRUE$ by default) indicating whether the variables should be shifted to be zero centered. Alternatively, a vector of length equal the number of columns of $A$ can be supplied. The value is passed to scale.

scale

bool ($TRUE$, $FALSE$), optional a logical value ($TRUE$ by default) indicating whether the variables should be scaled to have unit variance. Alternatively, a vector of length equal the number of columns of $A$ can be supplied. The value is passed to scale.

whiten

bool ($TRUE$, $FALSE$), optional When $TRUE$ (by default $FALSE$) the eigenvectors are divided by the the square root of the singular values $W = W * diag(1/sqrt(s))$. Whitening can sometimes improve the predictive accuracy.

retx

bool ($TRUE$, $FALSE$), optional a logical value ($FALSE$ by default) indicating whether the rotated variables / scores should be returned.

svdalg

str c('auto', 'rsvd', 'svd'), optional Determines which algorithm should be used for computing the singular value decomposition. By default 'auto' is used, which decides whether to use rsvd or svd, depending on the number of principle components. If $k < min(n,m)/1.5$ randomized svd is used.

int, optional oversampling parameter for $rsvd$ (default $p=5$), see rsvd.

int, optional number of power iterations for $rsvd$ (default $q=2$), see rsvd.

...

arguments passed to or from other methods, see rsvd.

.................

Value

rotation: array_like the matrix containing the rotation (eigenvectors), i.e., the variable loadings; array with dimensions $(n, k)$.
eigvals: array_like the eigenvalues; 1-d array of length $k$.
sdev: array_like the standard deviations of the principal components.
x: array_like if $retx$ is true a matrix containing the scores / rotated data (centred and scaled if requested) is returned.
center, scale: array_like the centering and scaling used, or $FALSE$.
.................: .

Details

Principal component analysis is a linear dimensionality reduction technique, aiming to keep only the most significant principal components to allow a better interpretation of the data and to project the data to a lower dimensional space.

Traditionally, the computation is done by a (deterministic) singular value decomposition. Randomized PCA is computed using a fast randomized algorithm (rsvd) to compute the approximate low-rank SVD decomposition. The computational gain is high if the desired number of principal components is small, i.e. $k << n$.

rsvd expects a numeric (real/complex) input matrix with dimensions $(m, n)$. Given a target rank $k$, rsvd factors the input matrix $A$ as $A = W * diag(s) * W'$. The columns of the real or complex unitary matrix $W$ contain the eigenvectors (i.e. principal components). The vector $s$ contains the corresponding eigenvalues. Following prcomp we denote this matrix $W$ as rotation matrix (commonly also called loadings).

The print and summary method can be used to present the results in a nice format. A scree plot can be produced with the plot function or as recommended with ggscreeplot. A biplot can be produced with ggbiplot, and a correlation plot with ggcorplot.

The predict function can be used to compute the scores of new observations. The data will automatically be centred (and scaled if requested). This is not fully supported for complex input matrices.

Examples

Run this code


library(rsvd)
#
# Load Edgar Anderson's Iris Data
#
data(iris)

#
# log transform
#
log.iris <- log( iris[ , 1:4] )
iris.species <- iris[ , 5]

#
# Perform rPCA and compute all PCs, similar to \code{\link{prcomp}}
#
iris.rpca <- rpca(log.iris, retx=TRUE,  svdalg = 'rsvd')
summary(iris.rpca) # Summary
print(iris.rpca) # Prints the loadings/ rotations

# You can compare the results with prcomp
# iris.pca <- prcomp(log.iris, center = TRUE, scale. = TRUE)
# summary(iris.pca) # Summary
# print(iris.pca) # Prints the loadings/ rotations

#
# Plot functions
#
ggscreeplot(iris.rpca) # Screeplot
ggscreeplot(iris.rpca, 'cum') # Screeplot
ggscreeplot(iris.rpca, type='eigenvals') # Screeplot of the eigenvalues

ggcorplot(iris.rpca, pcs=c(1,2)) # The correlation of the original variable with the PCs

ggbiplot(iris.rpca, groups = iris.species, circle = FALSE) #Biplot

#
# Perform rPCA and compute only the first two PCs
#
iris.rpca <- rpca(log.iris, k=2,  svdalg = 'rsvd')
summary(iris.rpca) # Summary
print(iris.rpca) # Prints the loadings/ rotations

#
# Compute the scores of new observations
#
preds <- predict(iris.rpca, newdata=data.frame(log.iris))

Run the code above in your browser using DataLab