Generate sparse data with outliers using simulation scheme detailed in Hubert et al. (2016).
dataGen(m = 100, n = 100, p = 10, a = c(0.9,0.5,0), bLength = 4, SD = c(10,5,2),
eps = 0, seed = TRUE)
A list with components:
List of length \(m\) containing all data matrices.
List of length \(m\) containing the numeric vectors with the indices of the contaminated observations.
Correlation matrix of the data, a numeric matrix of size \(p\) by \(p\).
Covariance matrix of the data (\(\Sigma\)), a numeric matrix of size \(p\) by \(p\).
Number of datasets to generate, default is 100.
Number of observations, default is 100.
Number of dimensions, default is 10.
Numeric vector containing the inner group correlations for each block. The number of useful blocks is thus given by \(k=length(a)-1\) which should be at least 2. By default, the correlations are equal to 0.9, 0.5 and 0, respectively.
Length of the blocks of useful variables, default is 4.
Numeric vector containing the standard deviations of the blocks of variables, default is c(10,4,2)
. Note that SD
and a
should have the same length.
Proportion of contamination, should be between 0 and 0.5. Default is 0 (no contamination).
Logical indicating if a seed is used when generating the datasets, default is TRUE
.
Tom Reynkens
Firstly, we generate a correlation matrix such that it has sparse eigenvectors.
We design the correlation matrix to have \(length(a)=k+1\) groups of variables with no correlation between variables from different groups. The first \(k\) groups consist of bLength
variables each. The correlation between the different variables of the group is equal to a[1]
for group 1, .... . The (k+1)th group contains the remaining \(p-k \times bLength\) variables, which we specify to have correlation a[k+1]
.
Secondly, the correlation matrix R
is transformed into the covariance matrix \(\Sigma= V^{0.5} \cdot R \cdot V^{0.5}\), where \(V=diag(SD^2)\).
Thirdly, the n
observations are generated from a \(p\)-variate normal distribution with mean the \(p\)-variate zero-vector and covariance matrix \(\Sigma\). Standard normally distributed noise terms are also added to each of the p
variables to make the sparse structure of the data harder to detect.
Finally, \((100 \times eps)\%\) of the data points are randomly replaced by outliers.
These outliers are generated from a \(p\)-variate normal distribution as in Croux et al. (2013).
The \(i\)th eigenvector of \(R\), for \(i=1,...,k\), is given by a (sparse) vector with the \((bLength \times (i-1)+1)\)th till the \((bLength \times i)\)th elements equal to \(1/\sqrt{bLength}\) and all other elements equal to zero.
See Hubert et al. (2016) for more details.
Hubert, M., Reynkens, T., Schmitt, E. and Verdonck, T. (2016). ``Sparse PCA for High-Dimensional Data with Outliers,'' Technometrics, 58, 424--434.
Croux, C., Filzmoser, P., and Fritz, H. (2013), ``Robust Sparse Principal Component Analysis,'' Technometrics, 55, 202--214.
X <- dataGen(m=1, n=100, p=10, eps=0.2, bLength=4)$data[[1]]
resR <- robpca(X, k=2, skew=FALSE)
diagPlot(resR)
Run the code above in your browser using DataLab