rlg: Robust Linear Grouping

Description

The function rlg() searches for clusters around affine subspaces of dimensions given by vector d (the length of that vector is the number of clusters). For instance d=c(1,2) means that we are clustering around a line and a plane. For robustifying the estimation, a proportion alpha of observations is trimmed. In particular, the trimmed k-means method is represented by the rlg method, if d=c(0,0,..0) (a vector of length k with zeroes).

Usage

rlg(
  x,
  d,
  alpha = 0.05,
  nstart = 500,
  niter1 = 3,
  niter2 = 20,
  nkeep = 5,
  scale = FALSE,
  parallel = FALSE,
  n.cores = -1,
  trace = FALSE
)

Value

Returns an object of class rlg which is basically a list with the following elements:

centers - A matrix of size p x k containing the location vectors (column-wise) of each cluster.
U - A list with k elements where each element is p x d_j matrix whose d_j columns are unitary and orthogonal vectors generating the affine subspace (after subtracting the corresponding cluster’s location parameter in centers). d_j is the intrinsic dimension of the affine subspace approximation in the j-th cluster, i.e., the elements of vector d.
cluster - A numerical vector of size n containing the cluster assignment for each observation. Cluster names are integer numbers from 1 to k, 0 indicates trimmed observations.
obj - The value of the objective function of the best (returned) solution.
cluster.ini - A matrix with nstart rows and number of columns equal to the number of observations and where each row shows the final clustering assignments (0 for trimmed observations) obtained after the niter1 iteration of the nstart random initializations.
obj.ini -A numerical vector of length nstart containing the values of the target function obtained after the niter1 iteration of the nstart random initializations.
x - The input data set.
dimensions - The input d vector with the intrinsic dimensions. The number of clusters is the length of that vector.
alpha - The input trimming level.

Arguments

x: A matrix or data.frame of dimension n x p, containing the observations (rowwise).
d: A numeric vector of length equal to the number of clusters to be detected. Each component of vector d indicates the intrinsic dimension of the affine subspace where observations on that cluster are going to be clustered. All the elements of vector d should be smaller than the problem dimension minus 1.
alpha: The proportion of observations to be trimmed.
nstart: The number of random initializations to be performed.
niter1: The number of concentration steps to be performed for the nstart initializations.
niter2: The maximum number of concentration steps to be performed for the nkeep solutions kept for further iteration. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.
nkeep: The number of iterated initializations (after niter1 concentration steps) with the best values in the target function that are kept for further iterations
scale: A robust centering and scaling (using the median and MAD) is done if TRUE.
parallel: A logical value, specifying whether the nstart initializations should be done in parallel.
n.cores: The number of cores to use when paralellizing, only taken into account if parallel=T.
trace: Defines the tracing level, which is set to 0 by default. Tracing level 1 gives additional information on the stage of the iterative process.

Author

Javier Crespo Guerrero, Jesús Fernández Iglesias, Luis Angel Garcia Escudero, Agustin Mayo Iscar.

Details

The procedure allows to deal with robust clustering around affine subspaces with an alpha proportion of trimming level by minimizing the trimmed sums of squared orthogonal residuals. Each component of vector d indicates the intrinsic dimension of the affine subspace where observations on that cluster are going to be clustered. Therefore a component equal to 0 on that vector implies clustering around centres, equal to 1 around lines, equal to 2 around planes and so on. The procedure so allows simultaneous clustering and dimensionality reduction.

This iterative algorithm performs "concentration steps" to improve the current cluster assignments. For approximately obtaining the global optimum, the procedure is randomly initialized nstart times and niter1 concentration steps are performed for them. The nkeep most “promising” iterations, i.e. the nkeep iterated solutions with the initial best values for the target function, are then iterated until convergence or until niter2 concentration steps are done.

References

García‐Escudero, L. A., Gordaliza, A., San Martin, R., Van Aelst, S., & Zamar, R. (2009). Robust linear clustering. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71, 301-318.

Examples

Run this code

##--- EXAMPLE 1 ------------------------------------------
data (LG5data)
x <- LG5data[, 1:10]
clus <- rlg(x, d = c(2,2,2), alpha=0.1)
plot(x, col=clus$cluster+1)
plot(clus, which="eigenvalues") 
plot(clus, which="scores") 

##--- EXAMPLE 2 ------------------------------------------
 data (pine) 
 clus <- rlg(pine, d = c(1,1,1), alpha=0.035)
 plot(pine, col=clus$cluster+1)

Run the code above in your browser using DataLab