The “standard” spherical \(k\)-means problem where all case
weights are one and \(m = 1\) is equivalent to maximizing the
criterion \(\sum_j \sum_{b \in C_j} \cos(x_b, p_j)\),
where \(C_j\) is the \(j\)-th class of the partition. This is the
formulation used in Dhillon & Modha (2001) and related references, and
when optimized over the prototypes yields the criterion function
\(\mathcal{I}_2\) in the CLUTO documentation.
Obtaining optimal spherical \(k\)-means partitions obviously is a
computationally hard problem, and several methods are available which
attempt to obtain optimal partitions. The built-in methods are as
follows.
"genetic"
a genetic algorithm patterned after the
genetic \(k\)-means algorithm of Krishna & Narasimha Murty (1999).
"pclust"
a Lloyd-Forgy style fixed-point algorithm which
iterates between determining optimal memberships for fixed
prototypes, and computing optimal prototypes for fixed
memberships. For hard partitions, this can optionally attempt
further local improvements via Kernighan-Lin chains of first
variation single object moves as suggested by Dhillon, Guan and
Kogan (2002).
"CLUTO"
an interface to the vcluster
partitional
clustering program from CLUTO, the CLUstering TOolkit by George
Karypis.
"gmeans"
an interface to the gmeans
partitional
clustering program by Yuqiang Guan.
"kmndirs"
an interface to the C code for the
\(k\)-mean-directions algorithm of Ranjan Maitra and Ivan
P. Ramler.
Method "pclust"
is the only method available for soft spherical
\(k\)-means problems. Method "genetic"
can handle case
weights. By default, the genetic algorithm is used for obtaining hard
partitions, and the fixed-point algorithm otherwise.
Common control parameters for methods "genetic"
and
"pclust"
are as follows.
start
a specification of the starting values to be
employed. Can either be a character vector with elements
"p"
(randomly pick objects as prototypes), "i"
(randomly pick ids for the objects), "S"
(take \(p\)
minimizing \(\sum_b w_b d(x_b, p)\) as the first prototype, and
successively pick objects farthest away from the already picked
prototypes), or "s"
(like "S"
, but with the first
prototype a randomly picked object). Can also be a list of
skmeans
objects (obtained by previous runs), a list of
prototype matrices, or a list of class ids. For the genetic
algorithm, the given starting values are used as the initial
population; the fixed-point algorithm is applied individually to
each starting value, and the best solution found is returned.
Defaults to randomly picking objects as prototypes.
reltol
The minimum relative improvement per
iteration. If improvement is less, the algorithm will stop under
the assumption that no further significant improvement can be
made. Defaults to sqrt(.Machine$double.eps)
.
verbose
a logical indicating whether to provide
some output on minimization progress.
Defaults to getOption("verbose")
.
Additional control parameters for method "genetic"
are as
follows.
maxiter
an integer giving the maximum number of
iterations for the genetic algorithm. Defaults to 12.
popsize
an integer giving the population size for the
genetic algorithm. Default: 6.
Only used if start
is not given.
mutations
a number between 0 and 1 giving the
probability of mutation per iteration. Defaults to 0.1.
Additional control parameters for method "pclust"
are as
follows.
maxiter
an integer giving the maximal number of
fixed-point iterations to be performed. Default: 100.
nruns
an integer giving the number of fixed-point runs
to be performed. Default: 1.
Only used if start
is not given.
maxchains
an integer giving the maximal length of the
Kernighan-Lin chains. Default: 0 (no first variation improvements
are attempted).
Control parameters for method "CLUTO"
are as follows.
vcluster
the path to the CLUTO vcluster
executable.
colmodel
a specification of the CLUTO column model.
See the CLUTO documentation for more details.
verbose
as for the genetic algorithm.
control
a character string specifying arguments passed
over to the vcluster
executable.
Control parameters for method "gmeans"
are as follows.
gmeans
the path to the gmeans
executable.
verbose
as for the genetic algorithm.
control
a character string specifying arguments passed
over to the gmeans
executable.
Control parameters for method "kmndirs"
are as follows.
nstart
an integer giving the number of starting points
to compute the starting value for the iteration stage.
Default: 100.
maxiter
an integer giving the maximum number of
iterations.
Default: 10.
Method "CLUTO"
requires that the CLUTO vcluster
executable is available. CLUTO binaries for the Linux, SunOS, Mac OS
X, and MS Windows platforms used to be downloadable from
https://www-users.cse.umn.edu/~karypis/cluto/.
If the executable cannot be found in the system path via
Sys.which("vcluster")
(i.e., named differently or not
made available in the system path), its (full) path must be specified
in control option vcluster
.
Method "gmeans"
requires that the gmeans
executable is
available.
Sources for compilation with ANSI C++ compliant compilers are
available from
https://github.com/feinerer/gmeans-ansi-compliant;
original sources can be obtained from
https://www.cs.utexas.edu/~dml/Software/gmeans.html.
If the executable cannot be found in the system path via
Sys.which("gmeans")
(i.e., named differently or not
made available in the system path), its (full) path must be specified
in control option gmeans
.
Method "kmndirs"
requires package kmndirs (available from
https://R-Forge.R-project.org/projects/kmndirs), which provides
an R interface to a suitable modification of the C code for the
\(k\)-mean-directions algorithm made available as supplementary
material to Maitra & Ramler (2010) at
https://www.tandfonline.com/doi/suppl/10.1198/jcgs.2009.08155.
User-defined methods must have formals x
, k
and
control
, and optionally may have formals weights
or m
if providing support for case weights or soft spherical
\(k\)-means partitions, respectively.