Two measures of dissimilarity between the within-cluster distributions of a dataset and normal or uniform distribution. For the normal it's the Kolmogorov dissimilarity between the Mahalanobis distances to the center and a chi-squared distribution. For the uniform it is the Kolmogorov distance between the distance to the kth nearest neighbour and a Gamma distribution (this is based on Byers and Raftery (1998)). The clusterwise values are aggregated by weighting with the cluster sizes.
distrsimilarity(x,clustering,noisecluster = FALSE,
distribution=c("normal","uniform"),nnk=2,
largeisgood=FALSE,messages=FALSE)
the data matrix; a numerical object which can be coerced to a matrix.
integer vector of class numbers; length must equal
nrow(x)
, numbers must go from 1 to the number of clusters.
logical. If TRUE
, the cluster with the
largest number is ignored for the computations.
vector of "normal", "uniform"
or
both. Indicates which of the two dissimilarities is/are computed.
integer. Number of nearest neighbors to use for dissimilarity to the uniform.
logical. If TRUE
, dissimilarities are
transformed to 1-d
(this means that larger values indicate a
better fit).
logical. If TRUE
, warnings are given if
within-cluster covariance matrices are not invertible (in which case
all within-cluster Mahalanobis distances are set to zero).
List with the following components
Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea).
Kolmogorov distance between distribution of distances to
nnk
th nearest within-cluster neighbor and appropriate
Gamma-distribution, see Byers and Raftery (1998), aggregated over
clusters.
vector of cluster-wise Kolmogorov distances between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution.
vector of cluster-wise Kolmogorov distances between
distribution of distances to nnk
th nearest within-cluster
neighbor and appropriate Gamma-distribution.
vector of Mahalanobs distances to the respective cluster center.
vector of distance to nnk
th nearest within-cluster
neighbor.
Byers, S. and Raftery, A. E. (1998) Nearest-Neighbor Clutter Removal for Estimating Features in Spatial Point Processes, Journal of the American Statistical Association, 93, 577-584.
Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282
cqcluster.stats
,cluster.stats
for more cluster validity statistics.
# NOT RUN {
set.seed(20000)
options(digits=3)
face <- rFace(200,dMoNo=2,dNoEy=0,p=2)
km3 <- kmeans(face,3)
distrsimilarity(face,km3$cluster)
# }
Run the code above in your browser using DataLab