The function rlg()
searches for clusters around affine subspaces of dimensions given by
vector d
(the length of that vector is the number of clusters). For instance d=c(1,2)
means that we are clustering around a line and a plane. For robustifying the estimation,
a proportion alpha
of observations is trimmed. In particular, the trimmed k-means
method is represented by the rlg method, if d=c(0,0,..0)
(a vector of length
k
with zeroes).
rlg(
x,
d,
alpha = 0.05,
nstart = 500,
niter1 = 3,
niter2 = 20,
nkeep = 5,
scale = FALSE,
parallel = FALSE,
n.cores = -1,
trace = FALSE
)
Returns an object of class rlg
which is basically a list with the following elements:
centers - A matrix of size p x k containing the location vectors (column-wise) of each cluster.
U - A list with k elements where each element is p x d_j matrix whose d_j columns are unitary and orthogonal vectors generating the affine subspace (after subtracting the corresponding cluster’s location parameter in centers). d_j is the intrinsic dimension of the affine subspace approximation in the j-th cluster, i.e., the elements of vector d.
cluster - A numerical vector of size n containing the cluster assignment for each observation. Cluster names are integer numbers from 1 to k, 0 indicates trimmed observations.
obj - The value of the objective function of the best (returned) solution.
cluster.ini - A matrix with nstart rows and number of columns equal to the number of observations and where each row shows the final clustering assignments (0 for trimmed observations) obtained after the niter1 iteration of the nstart random initializations.
obj.ini -A numerical vector of length nstart containing the values of the target function obtained after the niter1 iteration of the nstart random initializations.
x - The input data set.
dimensions - The input d vector with the intrinsic dimensions. The number of clusters is the length of that vector.
alpha - The input trimming level.
A matrix or data.frame of dimension n x p, containing the observations (rowwise).
A numeric vector of length equal to the number of clusters to be detected.
Each component of vector d
indicates the intrinsic dimension of the affine subspace
where observations on that cluster are going to be clustered. All the elements
of vector d
should be smaller than the problem dimension minus 1.
The proportion of observations to be trimmed.
The number of random initializations to be performed.
The number of concentration steps to be performed for the nstart initializations.
The maximum number of concentration steps to be performed for the nkeep solutions kept for further iteration. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.
The number of iterated initializations (after niter1 concentration steps) with the best values in the target function that are kept for further iterations
A robust centering and scaling (using the median and MAD) is done if TRUE.
A logical value, specifying whether the nstart initializations should be done in parallel.
The number of cores to use when paralellizing, only taken into account if parallel=T.
Defines the tracing level, which is set to 0 by default. Tracing level 1 gives additional information on the stage of the iterative process.
Javier Crespo Guerrero, Jesús Fernández Iglesias, Luis Angel Garcia Escudero, Agustin Mayo Iscar.
The procedure allows to deal with robust clustering around affine subspaces
with an alpha proportion of trimming level by minimizing the trimmed sums of squared
orthogonal residuals. Each component of vector d
indicates the intrinsic dimension of
the affine subspace where observations on that cluster are going to be clustered.
Therefore a component equal to 0 on that vector implies clustering around centres,
equal to 1 around lines, equal to 2 around planes and so on. The procedure so
allows simultaneous clustering and dimensionality reduction.
This iterative algorithm performs "concentration steps" to improve the current
cluster assignments. For approximately obtaining the global optimum, the procedure
is randomly initialized nstart
times and niter1
concentration steps are performed
for them. The nkeep
most “promising” iterations, i.e. the nkeep
iterated solutions
with the initial best values for the target function, are then iterated until
convergence or until niter2
concentration steps are done.
García‐Escudero, L. A., Gordaliza, A., San Martin, R., Van Aelst, S., & Zamar, R. (2009). Robust linear clustering. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71, 301-318.
##--- EXAMPLE 1 ------------------------------------------
data (LG5data)
x <- LG5data[, 1:10]
clus <- rlg(x, d = c(2,2,2), alpha=0.1)
plot(x, col=clus$cluster+1)
plot(clus, which="eigenvalues")
plot(clus, which="scores")
##--- EXAMPLE 2 ------------------------------------------
data (pine)
clus <- rlg(pine, d = c(1,1,1), alpha=0.035)
plot(pine, col=clus$cluster+1)
Run the code above in your browser using DataLab