cqcluster.stats: Cluster validation statistics (version for use with clusterbenchstats

Description

This is a more sophisticated version of cluster.stats for use with clusterbenchstats, see Hennig (2017). Computes a number of distance-based statistics, which can be used for cluster validation, comparison between clusterings and decision about the number of clusters: cluster sizes, cluster diameters, average distances within and between clusters, cluster separation, biggest within cluster gap, average silhouette widths, the Calinski and Harabasz index, a Pearson version of Hubert's gamma coefficient, the Dunn index, further statistics introduced in Hennig (2017) and two indexes to assess the similarity of two clusterings, namely the corrected Rand index and Meila's VI.

Usage

cqcluster.stats(d = NULL, clustering, alt.clustering = NULL,
                             noisecluster = FALSE, 
    silhouette = TRUE, G2 = FALSE, G3 = FALSE, wgap = TRUE, sepindex = TRUE, 
    sepprob = 0.1, sepwithnoise = TRUE, compareonly = FALSE, 
    aggregateonly = FALSE, 
    averagegap=FALSE, pamcrit=TRUE,
    dquantile=0.1,
    nndist=TRUE, nnk=2, standardisation="max", sepall=TRUE, maxk=10,
    cvstan=sqrt(length(clustering)))
# S3 method for cquality
summary(object,stanbound=TRUE,largeisgood=TRUE, ...)
# S3 method for summary.cquality
print(x, ...)

Arguments

a distance object (as generated by dist) or a distance matrix between cases.

clustering

an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters.

alt.clustering

an integer vector such as for clustering, indicating an alternative clustering. If provided, the corrected Rand index and Meila's VI for clustering vs. alt.clustering are computed.

noisecluster

logical. If TRUE, it is assumed that the largest cluster number in clustering denotes a 'noise class', i.e. points that do not belong to any cluster. These points are not taken into account for the computation of all functions of within and between cluster distances including the validation indexes.

silhouette

logical. If TRUE, the silhouette statistics are computed, which requires package cluster.

logical. If TRUE, Goodman and Kruskal's index G2 (cf. Gordon (1999), p. 62) is computed. This executes lots of sorting algorithms and can be very slow (it has been improved by R. Francois - thanks!)

logical. If TRUE, the index G3 (cf. Gordon (1999), p. 62) is computed. This executes sort on all distances and can be extremely slow.

wgap

logical. If TRUE, the widest within-cluster gaps (largest link in within-cluster minimum spanning tree) are computed. This is used for finding a good number of clusters in Hennig (2013). See also parameter averagegap.

sepindex

logical. If TRUE, a separation index is computed, defined based on the distances for every point to the closest point not in the same cluster. The separation index is then the mean of the smallest proportion sepprob of these. This allows to formalise separation less sensitive to a single or a few ambiguous points. The output component corresponding to this is sindex, not separation! This is used for finding a good number of clusters in Hennig (2013). See also parameter sepall.

sepprob

numerical between 0 and 1, see sepindex.

sepwithnoise

logical. If TRUE and sepindex and noisecluster are both TRUE, the noise points are incorporated as cluster in the separation index (sepindex) computation. Also they are taken into account for the computation for the minimum cluster separation.

compareonly

logical. If TRUE, only the corrected Rand index and Meila's VI are computed and given out (this requires alt.clustering to be specified).

aggregateonly

logical. If TRUE (and not compareonly), no clusterwise but only aggregated information is given out (this cuts the size of the output down a bit).

averagegap

logical. If TRUE, the average of the widest within-cluster gaps over all clusters is given out; if FALSE, the maximum is given out.

pamcrit

logical. If TRUE, the average distance of points to their respective cluster centroids is computed (criterion of the PAM clustering method); centroids are chosen so that they minimise this criterion for the given clustering.

dquantile

numerical between 0 and 1; quantile used for kernel density estimator for density indexes, see Hennig (2017), Sec. 3.6.

nndist

logical. If TRUE, average distance to nnkth nearest neighbour within cluster is computed.

nnk

integer. Number of neighbours used in average and coefficient of variation of distance to nearest within cluster neighbour (clusters with nnk or fewer points are ignored for this).

standardisation

"none", "max", "ave", "q90", or a number. See details.

sepall

logical. If TRUE, a fraction of smallest sepprob distances to other clusters is used from every cluster. Otherwise, a fraction of smallest sepprob distances overall is used in the computation of sindex.

maxk

numeric. Parsimony is defined as the number of clusters divided by maxk.

cvstan

numeric. cvnnd is standardised by cvstan if there is standardisation, see Details.

object

object of class cquality, output of cqcluster.stats.

stanbound

logical. If TRUE, all index values larger than 1 will be set to 1, and all values smaller than 0 will be set to 0. This is for preparation in case of largeisgood=TRUE (if values are already suitably standardised within cqcluster.stats, it won't do harm and can do good).

largeisgood

logical. If TRUE, indexes x are transformed to 1-x in case that before transformation smaller values indicate a better clustering (that's average.within, mnnd, widestgap, within.cluster.ss, dindex, denscut, pamc, max.diameter, highdgap, cvnnd. For this to make sense, cqcluster.stats should be run with standardisation="max" and summary.cquality with stanbound=TRUE.

...

no effect.

Value

cqcluster.stats with compareonly=FALSE and aggregateonly=FALSE returns a list of type cquality containing the components n, cluster.number, cluster.size, min.cluster.size, noisen, diameter, average.distance, median.distance, separation, average.toother, separation.matrix, ave.between.matrix, average.between, average.within, n.between, n.within, max.diameter, min.separation, within.cluster.ss, clus.avg.silwidths, avg.silwidth, g2, g3, pearsongamma, dunn, dunn2, entropy, wb.ratio, ch, cwidegap, widestgap, corrected.rand, vi, sindex, svec, psep, stan, nnk, mnnd, pamc, pamcentroids, dindex, denscut, highdgap, npenalty, dpenalty, withindensp, densoc, pdistto, pclosetomode, distto, percwdens, percdensoc, parsimony, cvnnd, cvnndc. Some of these are standardised, see Details. If compareonly=TRUE, only corrected.rand, vi are given out. If aggregateonly=TRUE, only n, cluster.number, min.cluster.size, noisen, diameter, average.between, average.within, max.diameter, min.separation, within.cluster.ss, avg.silwidth, g2, g3, pearsongamma, dunn, dunn2, entropy, wb.ratio, ch, widestgap, corrected.rand, vi, sindex, svec, psep, stan, nnk, mnnd, pamc, pamcentroids, dindex, denscut, highdgap, parsimony, cvnnd, cvnndc are given out.

summary.cquality returns a list of type summary.cquality with components average.within,nnk,mnnd, avg.silwidth, widestgap,sindex, pearsongamma,entropy,pamc, within.cluster.ss, dindex,denscut,highdgap, parsimony,max.diameter, min.separation,cvnnd. These are as documented below for cqcluster.stats, but after transformation by stanbound and largeisgood, see arguments.

number of points.

cluster.number

number of clusters.

cluster.size

vector of cluster sizes (number of points).

min.cluster.size

size of smallest cluster.

noisen

number of noise points, see argument noisecluster (noisen=0 if noisecluster=FALSE).

diameter

vector of cluster diameters (maximum within cluster distances).

average.distance

vector of clusterwise within cluster average distances.

median.distance

vector of clusterwise within cluster distance medians.

separation

vector of clusterwise minimum distances of a point in the cluster to a point of another cluster.

average.toother

vector of clusterwise average distances of a point in the cluster to the points of other clusters.

separation.matrix

matrix of separation values between all pairs of clusters.

ave.between.matrix

matrix of mean dissimilarities between points of every pair of clusters.

avebetween

average distance between clusters.

avewithin

average distance within clusters (reweighted so that every observation, rather than every distance, has the same weight).

n.between

number of distances between clusters.

n.within

number of distances within clusters.

maxdiameter

maximum cluster diameter.

minsep

minimum cluster separation.

withinss

a generalisation of the within clusters sum of squares (k-means objective function), which is obtained if d is a Euclidean distance matrix. For general distance measures, this is half the sum of the within cluster squared dissimilarities divided by the cluster size.

clus.avg.silwidths

vector of cluster average silhouette widths. See silhouette.

asw

average silhouette width. See silhouette.

Goodman and Kruskal's Gamma coefficient. See Milligan and Cooper (1985), Gordon (1999, p. 62).

G3 coefficient. See Gordon (1999, p. 62).

pearsongamma

correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001).

dunn

minimum separation / maximum diameter. Dunn index, see Halkidi et al. (2002).

dunn2

minimum average dissimilarity between two cluster / maximum average within cluster dissimilarity, another version of the family of Dunn indexes.

entropy

entropy of the distribution of cluster memberships, see Meila(2007).

wb.ratio

average.within/average.between.

Calinski and Harabasz index (Calinski and Harabasz 1974, optimal in Milligan and Cooper 1985; generalised for dissimilarites in Hennig and Liao 2013).

cwidegap

vector of widest within-cluster gaps.

widestgap

widest within-cluster gap or average of cluster-wise widest within-cluster gap, depending on parameter averagegap.

corrected.rand

corrected Rand index (if alt.clustering has been specified), see Gordon (1999, p. 198).

variation of information (VI) index (if alt.clustering has been specified), see Meila (2007).

sindex

separation index, see argument sepindex.

svec

vector of smallest closest distances of points to next cluster that are used in the computation of sindex if sepall=TRUE.

psep

vector of all closest distances of points to next cluster.

stan

value by which som statistics were standardised, see Details.

nnk

value of input parameter nnk.

mnnd

average distance to nnkth nearest neighbour within cluster.

pamc

average distance to cluster centroid.

pamcentroids

index numbers of cluster centroids.

dindex

this index measures to what extent the density decreases from the cluster mode to the outskirts; I-densdec in Sec. 3.6 of Hennig (2017); low values are good.

denscut

this index measures whether cluster boundaries run through density valleys; I-densbound in Sec. 3.6 of Hennig (2017); low values are good.

highdgap

this measures whether there is a large within-cluster gap with high density on both sides; I-highdgap in Sec. 3.6 of Hennig (2017); low values are good.

npenalty

vector of penalties for all clusters that are used in the computation of denscut, see Hennig (2017) (these are sums of penalties over all points in the cluster).

depenalty

vector of penalties for all clusters that are used in the computation of dindex, see Hennig (2017) (these are sums of several penalties for density increase when going from the mode outward in the cluster).

withindensp

distance-based kernel density values for all points as computed in Sec. 3.6 of Hennig (2017).

densoc

contribution of points from other clusters than the one to which a point is assigned to the density, for all points; called h_o in Sec. 3.6 of Hennig (2017).

pdistto

list that for all clusters has a sequence of point numbers. These are the points already incorporated in the sequence of points constructed in the algorithm in Sec. 3.6 of Hennig (2017) to which the next point to be joined is connected.

pclosetomode

list that for all clusters has a sequence of point numbers. Sequence of points to be incorporated in the sequence of points constructed in the algorithm in Sec. 3.6 of Hennig (2017).

distto

list that for all clusters has a sequence of differences between the standardised densities (see percwdens) at the new point added and the point to which it is connected (if this is positive, the penalty is this to the square), in the algorithm in Sec. 3.6 of Hennig (2017).

percwdens

this is withindensp divided by its maximum.

percdensoc

this is densoc divided by the maximum of withindensp, called h_o^* in Sec. 3.6 of Hennig (2017).

parsimony

number of clusters divided by maxk.

cvnnd

coefficient of variation of dissimilarities to nnkth nearest within-cluster neighbour, measuring uniformity of within-cluster densities, weighted over all clusters, see Sec. 3.7 of Hennig (2017).

cvnndc

vector of cluster-wise coefficients of variation of dissimilarities to nnkth nearest wthin-cluster neighbour as required in computation of cvnnd.

Details

The standardisation-parameter governs the standardisation of the index values. standardisation="none" means that unstandardised raw values of indexes are given out. Otherwise, entropy will be standardised by the maximum possible value for the given number of clusters; within.cluster.ss and between.cluster.ss will be standardised by the overall sum of squares; mnnd will be standardised by the maximum distance to the nnkth nearest neighbour within cluster; pearsongamma will be standardised by adding 1 and dividing by 2; cvnn will be standardised by cvstan (the default is the possible maximum).

standardisation allows options for the standardisation of average.within, sindex, wgap, pamcrit, max.diameter, min.separation and can be "max" (maximum distance), "ave" (average distance), q90 (0.9-quantile of distances), or a positive number. "max" is the default and standardises all the listed indexes into the range [0,1].

References

Calinski, T., and Harabasz, J. (1974) A Dendrite Method for Cluster Analysis, Communications in Statistics, 3, 1-27.

Gordon, A. D. (1999) Classification, 2nd ed. Chapman and Hall.

Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001) On Clustering Validation Techniques, Journal of Intelligent Information Systems, 17, 107-145.

Hennig, C. and Liao, T. (2013) How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society, Series C Applied Statistics, 62, 309-369.

Hennig, C. (2013) How many bee species? A case study in determining the number of clusters. In: Spiliopoulou, L. Schmidt-Thieme, R. Janning (eds.): "Data Analysis, Machine Learning and Knowledge Discovery", Springer, Berlin, 41-49.

Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282

Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.

Meila, M. (2007) Comparing clusterings?an information based distance, Journal of Multivariate Analysis, 98, 873-895.

Milligan, G. W. and Cooper, M. C. (1985) An examination of procedures for determining the number of clusters. Psychometrika, 50, 159-179.

Examples

Run this code

# NOT RUN {
  set.seed(20000)
  options(digits=3)
  face <- rFace(200,dMoNo=2,dNoEy=0,p=2)
  dface <- dist(face)
  complete3 <- cutree(hclust(dface),3)
  cqcluster.stats(dface,complete3,
                alt.clustering=as.integer(attr(face,"grouping")))
  
# }

Run the code above in your browser using DataLab