robpca: ROBust PCA algorithm

Description

ROBPCA algorithm of Hubert et al. (2005) including reweighting (Engelen et al., 2005) and possible extension to skewed data (Hubert et al., 2009).

Usage

robpca (x, k = 0, kmax = 10, alpha = 0.75, h = NULL, mcd = FALSE, 
        ndir = "all", skew = FALSE, ...)

Value

A list with components:

loadings: Loadings matrix containing the robust loadings (eigenvectors), a numeric matrix of size \(p\) by \(k\).
eigenvalues: Numeric vector of length \(k\) containing the robust eigenvalues.
scores: Scores matrix (computed as \((X-center) \cdot loadings)\), a numeric matrix of size \(n\) by \(k\).
center: Numeric vector of length \(k\) containing the centre of the data.
k: Number of (chosen) principal components.
H0: Logical vector of size \(n\) indicating if an observation is in the initial h-subset.
H1: Logical vector of size \(n\) indicating if an observation is kept in the reweighting step.
alpha: The robustness parameter \(\alpha\) used throughout the algorithm.
h: The \(h\)-parameter used throughout the algorithm.
sd: Numeric vector of size \(n\) containing the robust score distances within the robust PCA subspace.
od: Numeric vector of size \(n\) containing the orthogonal distances to the robust PCA subspace.
cutoff.sd: Cut-off value for the robust score distances.
cutoff.od: Cut-off value for the orthogonal distances.
flag.sd: Numeric vector of size \(n\) containing the SD-flags of the observations. The observations whose score distance is larger than cutoff.sd receive an SD-flag equal to zero. The other observations receive an SD-flag equal to 1.
flag.od: Numeric vector of size \(n\) containing the OD-flags of the observations. The observations whose orthogonal distance is larger than cutoff.od receive an OD-flag equal to zero. The other observations receive an OD-flag equal to 1.
flag.all: Numeric vector of size \(n\) containing the flags of the observations. The observations whose score distance is larger than cutoff.sd or whose orthogonal distance is larger than cutoff.od can be considered as outliers and receive a flag equal to zero. The regular observations receive flag 1.

Arguments

x: An \(n\) by \(p\) matrix or data matrix with observations in the rows and variables in the columns.
k: Number of principal components that will be used. When k=0 (default), the number of components is selected using the criterion in Hubert et al. (2005).
kmax: Maximal number of principal components that will be computed, default is 10.
alpha: Robustness parameter, default is 0.75.
h: The number of outliers the algorithm should resist is given by \(n-h\). Any value for h between \(n/2\) and \(n\) may be specified. Default is NULL which uses h=ceiling(alpha*n)+1. Do not specify alpha and h at the same time.
mcd: Logical indicating if the MCD adaptation of ROBPCA may be applied when the number of variables is sufficiently small (see Details). If mcd=FALSE (default), the full ROBPCA algorithm is always applied.
ndir: Number of directions used when computing the outlyingness (or the adjusted outlyingness when skew=TRUE), see outlyingness and adjOutl for more details.
skew: Logical indicating if the version for skewed data (Hubert et al., 2009) is applied, default is FALSE.
...: Other arguments to pass to methods.

Author

Tom Reynkens, based on R code from Valentin Todorov for PcaHubert in rrcov (released under GPL-3) and Matlab code from Katrien Van Driessen (for the univariate MCD).

Details

This function is based extensively on PcaHubert from rrcov and there are two main differences:

The outlyingness measure that is used for non-skewed data (skew=FALSE) is the Stahel-Donoho measure as described in Hubert et al. (2005) which is also used in PcaHubert. The implementation in mrfDepth (which is used here) is however much faster than the one in PcaHubert and hence more, or even all, directions can be considered when computing the outlyingness measure.

Moreover, the extension for skewed data of Hubert et al. (2009) (skew=TRUE) is also implemented here, but this is not included in PcaHubert.

For an extensive description of the ROBPCA algorithm we refer to Hubert et al. (2005) and to PcaHubert.

When mcd=TRUE and \(n<5 \times p\), we do not apply the full ROBPCA algorithm. The loadings and eigenvalues are then computed as the eigenvectors and eigenvalues of the MCD estimator applied to the data set after the SVD step.

References

Hubert, M., Rousseeuw, P. J., and Vanden Branden, K. (2005), ``ROBPCA: A New Approach to Robust Principal Component Analysis,'' Technometrics, 47, 64--79.

Engelen, S., Hubert, M. and Vanden Branden, K. (2005), ``A Comparison of Three Procedures for Robust PCA in High Dimensions", Austrian Journal of Statistics, 34, 117--126.

Hubert, M., Rousseeuw, P. J., and Verdonck, T. (2009), ``Robust PCA for Skewed Data and Its Outlier Map," Computational Statistics & Data Analysis, 53, 2264--2274.

Examples

Run this code

X <- dataGen(m=1, n=100, p=10, eps=0.2, bLength=4)$data[[1]]

resR <- robpca(X, k=2)
diagPlot(resR)

Run the code above in your browser using DataLab