The dataset consists of 2000 data points in \(R^{14}\). On the subset of relevant clustering variables \(S = \{1, 2\}\), data are distributed from a mixture of four equiprobable spherical Gaussian distributions with means \((0,0), (4,0) (0,2)\) and \((4,2)\). The subset of redundant variables is \(U =\{3-11\}\) that are explained by the subset of predictor variables \(R = \{1,2\}\). The last three variables are independent \(W = \{11, 12, 13\}\).
A data matrix with 2000 observations on 14 variables and the last column contains the labels.
scenarioCor[,1:14]
a numeric matrix containing the observations
scenarioCor[,15]
an integer vector containing the labels
The subset \(U\) of redundant variables is simulated as follows :
\(x^{U} = (0,0, 0.4, 0.8, ..., 2) + x^{S} b + \varepsilon\), with \(\varepsilon \sim N(0_9, \Omega)\)
The subset \(W\) of independent variables is simulated as follows :
\(x^{W} \sim N((3.2, 3.6, 4), I_3)\)
For more details on the regression coefficients \(b\) and the covariance matrix \(\Omega\) see Maugis et al.(2009).
Maugis, C., Celeux, G., and Martin-Magniette, M. L., 2009. "Variable selection in model-based clustering: A general variable role modeling". Computational Statistics and Data Analysis, vol. 53/11, pp. 3872-3882.
# NOT RUN {
data(scenarioCor)
# }
Run the code above in your browser using DataLab