This function generates multivariate normal datasets with several possible types of outliers. It is used in several simulation studies. For a detailed description, see the referenced papers.
generateData(n, d, mu, Sigma, perout, gamma,
outlierType = "casewise", seed = NULL)
A list with components:
X
The generated data matrix of size \(n \times d\).
indcells
A vector with the indices of the contaminated cells.
indrows
A vector with the indices of the rowwise outliers.
The number of observations
The dimension of the data.
The center of the clean data.
The covariance matrix of the clean data. Could be obtained from generateCorMat
.
The type of contamination to be generated. Should be one of:
"casewise"
: Generates point contamination in the direction of the last eigenvector of Sigma
.
"cellwisePlain"
: Generates cellwise contamination by randomly replacing a number of cells by gamma
.
"cellwiseStructured"
: Generates cellwise contamination by first randomly sampling contaminated cells, after which for each row, they are replaced by a multiple of the smallest eigenvector of Sigma
restricted to the dimensions of the contaminated cells.
"both"
: combines "casewise"
and "cellwiseStructured"
.
The percentage of generated outliers. For outlierType = "casewise"
this is a fraction of rows. For outlierType = "cellWisePlain"
or outlierType = "cellWiseStructured"
, a fraction of perout
cells are replaced by contaminated cells.
For outlierType = "both"
, a fraction of \(0.5*\)perout
of rowwise
outliers is generated, after which the remaining data is contaminated with a fraction of
\(0.5*\)perout
outlying cells.
How far outliers are from the center of the distribution.
Seed used to generate the data.
J. Raymaekers and P.J. Rousseeuw
C. Agostinelli, Leung, A., Yohai, V. J., and Zamar, R. H. (2015). Robust Estimation of Multivariate Location and Scatter in the Presence of Cellwise and Casewise Contamination. Test, 24, 441-461.
Rousseeuw, P.J., Van den Bossche W. (2018). Detecting Deviating Data Cells. Technometrics, 60(2), 135-145. (link to open access pdf)
J. Raymaekers and P.J. Rousseeuw (2020). Handling cellwise outliers by sparse regression and robust covariance. Arxiv: 1912.12446. (link to open access pdf)
generateCorMat
n <- 100
d <- 5
mu <- rep(0, d)
Sigma <- diag(d)
perout <- 0.1
gamma <- 10
data <- generateData(n, d, mu, Sigma, perout, gamma, outlierType = "cellwisePlain", seed = 1)
pairs(data$X)
data$indcells
Run the code above in your browser using DataLab