cluster_validity: Internal Cluster Validity Measures

Description

Implementation of a number of so-called cluster validity indices critically reviewed in (Gagolewski, Bartoszuk, Cena, 2021). See Section 2 therein and (Gagolewski, 2022) for the respective definitions.

The greater the index value, the more valid (whatever that means) the assessed partition. For consistency, the Ball-Hall and Davies-Bouldin indexes as well as the within-cluster sum of squares (WCSS) take negative values.

Usage

calinski_harabasz_index(X, y)
dunnowa_index(
  X,
  y,
  M = 25L,
  owa_numerator = "SMin:5",
  owa_denominator = "Const"
)
generalised_dunn_index(X, y, lowercase_d, uppercase_d)
negated_ball_hall_index(X, y)
negated_davies_bouldin_index(X, y)
negated_wcss_index(X, y)
silhouette_index(X, y)
silhouette_w_index(X, y)
wcnn_index(X, y, M = 25L)

Value

A single numeric value (the more, the better).

Arguments

X: numeric matrix with n rows and d columns, representing n points in a d-dimensional space
y: vector of n integer labels, representing a partition whose quality is to be assessed; y[i] is the cluster ID of the i-th point, X[i, ]; 1 <= y[i] <= K, where K is the number or clusters
M: number of nearest neighbours
owa_numerator, owa_denominator: single string specifying the OWA operators to use in the definition of the DuNN index; one of: "Mean", "Min", "Max", "Const", "SMin:D", "SMax:D", where D is an integer defining the degree of smoothness
lowercase_d: an integer between 1 and 5, denoting \(d_1\), ..., \(d_5\) in the definition of the generalised Dunn (Bezdek-Pal) index (numerator: min, max, and mean pairwise intracluster distance, distance between cluster centroids, weighted point-centroid distance, respectively)
uppercase_d: an integer between 1 and 3, denoting \(D_1\), ..., \(D_3\) in the definition of the generalised Dunn (Bezdek-Pal) index (denominator: max and min pairwise intracluster distance, average point-centroid distance, respectively)

Author

Marek Gagolewski and other contributors

References

Ball G.H., Hall D.J., ISODATA: A novel method of data analysis and pattern classification, Technical report No. AD699616, Stanford Research Institute, 1965.

Bezdek J., Pal N., Some new indexes of cluster validity, IEEE Transactions on Systems, Man, and Cybernetics, Part B 28, 1998, 301-315, tools:::Rd_expr_doi("10.1109/3477.678624").

Calinski T., Harabasz J., A dendrite method for cluster analysis, Communications in Statistics 3(1), 1974, 1-27, tools:::Rd_expr_doi("10.1080/03610927408827101").

Davies D.L., Bouldin D.W., A Cluster Separation Measure, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1 (2), 1979, 224-227, tools:::Rd_expr_doi("10.1109/TPAMI.1979.4766909").

Dunn J.C., A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters, Journal of Cybernetics 3(3), 1973, 32-57, tools:::Rd_expr_doi("10.1080/01969727308546046").

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620-636, 2021, tools:::Rd_expr_doi("10.1016/j.ins.2021.10.004"); preprint: https://raw.githubusercontent.com/gagolews/bibliography/master/preprints/2021cvi.pdf.

Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270, tools:::Rd_expr_doi("10.1016/j.softx.2022.101270"), https://clustering-benchmarks.gagolewski.com.

Rousseeuw P.J., Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis, Computational and Applied Mathematics 20, 1987, 53-65, tools:::Rd_expr_doi("10.1016/0377-0427(87)90125-7").

Examples

Run this code

X <- as.matrix(iris[,1:4])
X[,] <- jitter(X)  # otherwise we get a non-unique solution
y <- as.integer(iris[[5]])
calinski_harabasz_index(X, y)  # good
calinski_harabasz_index(X, sample(1:3, nrow(X), replace=TRUE))  # bad

Run the code above in your browser using DataLab