This program performs K-Means clustering on the given dataset. It can return
the learned cluster assignments, and the centroids of the clusters. Empty
clusters are not allowed by default; when a cluster becomes empty, the point
furthest from the centroid of the cluster with maximum variance is taken to
fill that cluster.
Optionally, the strategy to choose initial centroids can be specified. The
k-means++ algorithm can be used to choose initial centroids with the
"kmeans_plus_plus" parameter. The Bradley and Fayyad approach ("Refining
initial points for k-means clustering", 1998) can be used to select initial
points by specifying the "refined_start" parameter. This approach works by
taking random samplings of the dataset; to specify the number of samplings,
the "samplings" parameter is used, and to specify the percentage of the
dataset to be used in each sample, the "percentage" parameter is used (it
should be a value between 0.0 and 1.0).
There are several options available for the algorithm used for each Lloyd
iteration, specified with the "algorithm" option. The standard O(kN)
approach can be used ('naive'). Other options include the Pelleg-Moore
tree-based algorithm ('pelleg-moore'), Elkan's triangle-inequality based
algorithm ('elkan'), Hamerly's modification to Elkan's algorithm ('hamerly'),
the dual-tree k-means algorithm ('dualtree'), and the dual-tree k-means
algorithm using the cover tree ('dualtree-covertree').
The behavior for when an empty cluster is encountered can be modified with
the "allow_empty_clusters" option. When this option is specified and there
is a cluster owning no points at the end of an iteration, that cluster's
centroid will simply remain in its position from the previous iteration. If
the "kill_empty_clusters" option is specified, then when a cluster owns no
points at the end of an iteration, the cluster centroid is simply filled with
DBL_MAX, killing it and effectively reducing k for the rest of the
computation. Note that the default option when neither empty cluster option
is specified can be time-consuming to calculate; therefore, specifying either
of these parameters will often accelerate runtime.
Initial clustering assignments may be specified using the "initial_centroids"
parameter, and the maximum number of iterations may be specified with the
"max_iterations" parameter.