bpca(Matrix, nPcs = 2, maxSteps = 100, verbose = interactive(), threshold = 1e-04, ...)
matrix
-- Pre-processed matrix (centered,
scaled) with variables in columns and observations in rows. The
data may contain missing values, denoted as NA
.numeric
-- Number of components used for
re-estimation. Choosing few components may decrease the
estimation precision.numeric
-- Maximum number of estimation
steps.boolean
-- BPCA prints the number of steps
and the increase in precision if set to TRUE. Default is
interactive().pcaRes
for details.
The authors also state that the difference between real and predicted Eigenvalues becomes larger when the number of observation is smaller, because it reflects the lack of information to accurately determine true factor loadings from the limited and noisy data. As a result, weights of factors to predict missing values are not the same as with conventional PCA, but the missing value estimation is improved.
BPCA works iteratively, the complexity is growing with $O(n^3)$ because several matrix inversions are required. The size of the matrices to invert depends on the number of components used for re-estimation.
Finding the optimal number of components for estimation is not a
trivial task; the best choice depends on the internal structure of
the data. A method called kEstimate
is provided to
estimate the optimal number of components via cross validation.
In general few components are sufficient for reasonable estimation
accuracy. See also the package documentation for further
discussion about on what data PCA-based missing value estimation
makes sense.
It is not recommended to use this function directely but rather to use the pca() wrapper function.
There is a difference with respect the interpretation of rows (observations) and columns (variables) compared to matlab implementation. For estimation of missing values for microarray data, the suggestion in the original bpca is to intepret genes as observations and the samples as variables. In pcaMethods however, genes are interpreted as variables and samples as observations which arguably also is the more natural interpretation. For bpca behavior like in the matlab implementation, simply transpose your input matrix.
Details about the probabilistic model underlying BPCA are found in Oba et. al 2003. The algorithm uses an expectation maximation approach together with a Bayesian model to approximate the principal axes (eigenvectors of the covariance matrix in PCA). The estimation is done iteratively, the algorithm terminates if either the maximum number of iterations was reached or if the estimated increase in precision falls below $1e^-4$.
Complexity: The relatively high complexity of the method is a result of several matrix inversions required in each step. Considering the case that the maximum number of iteration steps is needed, the approximate complexity is given by the term $$maxSteps \cdot row_{miss} \cdot O(n^3)$$ Where $row_miss$ is the number of rows containing missing values and $O(n^3)$ is the complexity for inverting a matrix of size $components$. Components is the number of components used for re-estimation.
ppca
, svdImpute
,
prcomp
, nipalsPca
,
pca
,
pcaRes
. kEstimate
.
## Load a sample metabolite dataset with 5\% missig values (metaboliteData)e
data(metaboliteData)
## Perform Bayesian PCA with 2 components
pc <- pca(t(metaboliteData), method="bpca", nPcs=2)
## Get the estimated principal axes (loadings)
loadings <- loadings(pc)
## Get the estimated scores
scores <- scores(pc)
## Get the estimated complete observations
cObs <- completeObs(pc)
## Now make a scores and loadings plot
slplot(pc)
Run the code above in your browser using DataLab