Two measures of dissimilarity between the within-cluster distributions of a dataset and normal or uniform distribution. For the normal it's the Kolmogorov dissimilarity between the Mahalanobis distances to the center and a chi-squared distribution. For the uniform it is the Kolmogorov distance between the distance to the kth nearest neighbour and a Gamma distribution (this is based on Byers and Raftery (1998)). The clusterwise values are aggregated by weighting with the cluster sizes.
distrsimilarity(x,clustering,noisecluster = FALSE,
distribution=c("normal","uniform"),nnk=2,
largeisgood=FALSE,messages=FALSE)List with the following components
Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea).
Kolmogorov distance between distribution of distances to
nnkth nearest within-cluster neighbor and appropriate
Gamma-distribution, see Byers and Raftery (1998), aggregated over
clusters.
vector of cluster-wise Kolmogorov distances between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution.
vector of cluster-wise Kolmogorov distances between
distribution of distances to nnkth nearest within-cluster
neighbor and appropriate Gamma-distribution.
vector of Mahalanobs distances to the respective cluster center.
vector of distance to nnkth nearest within-cluster
neighbor.
the data matrix; a numerical object which can be coerced to a matrix.
integer vector of class numbers; length must equal
nrow(x), numbers must go from 1 to the number of clusters.
logical. If TRUE, the cluster with the
largest number is ignored for the computations.
vector of "normal", "uniform" or
both. Indicates which of the two dissimilarities is/are computed.
integer. Number of nearest neighbors to use for dissimilarity to the uniform.
logical. If TRUE, dissimilarities are
transformed to 1-d (this means that larger values indicate a
better fit).
logical. If TRUE, warnings are given if
within-cluster covariance matrices are not invertible (in which case
all within-cluster Mahalanobis distances are set to zero).
Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/
Byers, S. and Raftery, A. E. (1998) Nearest-Neighbor Clutter Removal for Estimating Features in Spatial Point Processes, Journal of the American Statistical Association, 93, 577-584.
Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282
cqcluster.stats,cluster.stats
for more cluster validity statistics.
set.seed(20000)
options(digits=3)
face <- rFace(200,dMoNo=2,dNoEy=0,p=2)
km3 <- kmeans(face,3)
distrsimilarity(face,km3$cluster)
Run the code above in your browser using DataLab