distrsimilarity: Similarity of within-cluster distributions to normal and uniform

Description

Two measures of dissimilarity between the within-cluster distributions of a dataset and normal or uniform distribution. For the normal it's the Kolmogorov dissimilarity between the Mahalanobis distances to the center and a chi-squared distribution. For the uniform it is the Kolmogorov distance between the distance to the kth nearest neighbour and a Gamma distribution (this is based on Byers and Raftery (1998)). The clusterwise values are aggregated by weighting with the cluster sizes.

Usage

distrsimilarity(x,clustering,noisecluster = FALSE,
distribution=c("normal","uniform"),nnk=2,
largeisgood=FALSE,messages=FALSE)

Value

List with the following components

kdnorm: Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea).
kdunif: Kolmogorov distance between distribution of distances to nnkth nearest within-cluster neighbor and appropriate Gamma-distribution, see Byers and Raftery (1998), aggregated over clusters.
kdnormc: vector of cluster-wise Kolmogorov distances between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution.
kdunifc: vector of cluster-wise Kolmogorov distances between distribution of distances to nnkth nearest within-cluster neighbor and appropriate Gamma-distribution.
xmahal: vector of Mahalanobs distances to the respective cluster center.
xdknn: vector of distance to nnkth nearest within-cluster neighbor.

Arguments

x: the data matrix; a numerical object which can be coerced to a matrix.
clustering: integer vector of class numbers; length must equal nrow(x), numbers must go from 1 to the number of clusters.
noisecluster: logical. If TRUE, the cluster with the largest number is ignored for the computations.
distribution: vector of "normal", "uniform" or both. Indicates which of the two dissimilarities is/are computed.
nnk: integer. Number of nearest neighbors to use for dissimilarity to the uniform.
largeisgood: logical. If TRUE, dissimilarities are transformed to 1-d (this means that larger values indicate a better fit).
messages: logical. If TRUE, warnings are given if within-cluster covariance matrices are not invertible (in which case all within-cluster Mahalanobis distances are set to zero).

Author

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/

References

Byers, S. and Raftery, A. E. (1998) Nearest-Neighbor Clutter Removal for Estimating Features in Spatial Point Processes, Journal of the American Statistical Association, 93, 577-584.

Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282

Examples

Run this code

  set.seed(20000)
  options(digits=3)
  face <- rFace(200,dMoNo=2,dNoEy=0,p=2)
  km3 <- kmeans(face,3)
  distrsimilarity(face,km3$cluster)

Run the code above in your browser using DataLab