Probabilistic PCA combines an EM approach for PCA with a
probabilistic model. The EM approach is based on the assumption
that the latent variables as well as the noise are normal
distributed.In standard PCA data which is far from the training set but close
to the principal subspace may have the same reconstruction error.
PPCA defines a likelihood function such that the likelihood for
data far from the training set is much lower, even if they are
close to the principal subspace. This allows to improve the
estimation accuracy.
A method called kEstimate
is provided to estimate the
optimal number of components via cross validation. In general few
components are sufficient for reasonable estimation accuracy. See
also the package documentation for further discussion on what kind
of data PCA-based missing value estimation is advisable.
Complexity: Runtime is linear in the number of data,
number of data dimensions and number of principal components.
Convergence: The threshold indicating convergence was
changed from 1e-3 in 1.2.x to 1e-5 in the current version leading
to more stable results. For reproducability you can set the seed
(parameter seed) of the random number generator. If used for
missing value estimation, results may be checked by simply running
the algorithm several times with changing seed, if the estimated
values show little variance the algorithm converged well.