The objects of class "valstat"
store cluster validation
statistics from various clustering methods run with various numbers of
clusters.
A legitimate valstat
object is a list. The format of the list
relies on the number of involved clustering methods, nmethods
,
say, i.e., the length
of the method
-component explained below. The first
nmethods
elements of the valstat
-list are just
numbered. These are themselves lists that are numbered between 1 and
the maxG
-component defined below. Element [[i]][[j]]
refers to the clustering from clustering method number i with number
of clusters j. Every such element is a list
with components
avewithin, mnnd, cvnnd, maxdiameter, widestgap, sindex, minsep,
asw, dindex, denscut, highdgap, pearsongamma, withinss, entropy
:
Further optional components are pamc, kdnorm, kdunif,
dmode, aggregated
. All these are cluster validation indexes, as
follows.
average distance within clusters (reweighted so that every observation, rather than every distance, has the same weight).
average distance to nnk
th nearest neighbour within
cluster. (nnk
is a parameter of
cqcluster.stats
, default 2.)
coefficient of variation of dissimilarities to
nnk
th nearest wthin-cluster neighbour, measuring uniformity of
within-cluster densities, weighted over all clusters, see Sec. 3.7 of
Hennig (2017). (nnk
is a parameter of
cqcluster.stats
, default 2.)
maximum cluster diameter.
widest within-cluster gap or average of cluster-wise
widest within-cluster gap, depending on parameter averagegap
of cqcluster.stats
, default FALSE
.
separation index. Defined based on the distances for
every point to the
closest point not in the same cluster. The separation index is then
the mean of the smallest proportion sepprob
(parameter of
cqcluster.stats
, default 0.1) of these. See Hennig (2017).
minimum cluster separation.
average silhouette
width. See silhouette
.
this index measures to what extent the density decreases from the cluster mode to the outskirts; I-densdec in Sec. 3.6 of Hennig (2017); low values are good.
this index measures whether cluster boundaries run through density valleys; I-densbound in Sec. 3.6 of Hennig (2017); low values are good.
this measures whether there is a large within-cluster gap with high density on both sides; I-highdgap in Sec. 3.6 of Hennig (2017); low values are good.
correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001).
a generalisation of the within clusters sum
of squares (k-means objective function), which is obtained if
d
is a Euclidean distance matrix. For general distance
measures, this is half
the sum of the within cluster squared dissimilarities divided by the
cluster size.
entropy of the distribution of cluster memberships, see Meila(2007).
average distance to cluster centroid, which is the observation that minimises this average distance.
Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea).
Kolmogorov distance between distribution of distances to
dnnk
th nearest within-cluster neighbor and appropriate
Gamma-distribution, see Byers and Raftery (1998), aggregated over
clusters. dnnk
is parameter nnk
of
distrsimilarity
, corresponding to dnnk
of
clusterbenchstats
.
aggregated density mode index equal to
0.75*dindex+0.25*highdgap
before standardisation.
Furthermore, a valstat object has the following list components:
maximum number of clusters.
minimum number of clusters (list entries below that number are empty lists).
vector of names (character strings) of clustering
CBI-functions, see kmeansCBI
.
vector of names (character strings) of clustering
methods. These can be user-chosen names (see argument
methodsnames
in clusterbenchstats
) and may
distinguish different methods run by the same CBI-function but with
different parameter values such as complete and average linkage for
hclustCBI
.
vector of names (character strings) of cluster validation indexes.
These objects are generated as part of the
clusterbenchstats
-output.
The valstat
class has methods for the following generic functions:
print
, plot
, see plot.valstat
.
Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282