The functions described in this section quantify the similarity between
two label vectors x
and y
which represent two partitions
of a set of \(n\) elements into, respectively, \(K\) and \(L\)
nonempty and pairwise disjoint subsets.
For instance, x
and y
can represent two clusterings
of a dataset with \(n\) observations specified by two vectors
of labels. The functions described here can be used as external cluster
validity measures, where we assume that x
is
a reference (ground-truth) partition whilst y
is the vector
of predicted cluster memberships.
All indices except normalized_clustering_accuracy()
can act as a pairwise partition similarity score: they are symmetric,
i.e., index(x, y) == index(y, x)
.
Each index except mi_score()
(which computes the mutual
information score) outputs 1 given two identical partitions.
Note that partitions are always defined up to a permutation (bijection)
of the set of possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4)
represent the same 2-partition.
normalized_clustering_accuracy(x, y = NULL)normalized_pivoted_accuracy(x, y = NULL)
pair_sets_index(x, y = NULL, simplified = FALSE, clipped = TRUE)
adjusted_rand_score(x, y = NULL, clipped = FALSE)
rand_score(x, y = NULL)
adjusted_fm_score(x, y = NULL, clipped = FALSE)
fm_score(x, y = NULL)
mi_score(x, y = NULL)
normalized_mi_score(x, y = NULL)
adjusted_mi_score(x, y = NULL, clipped = FALSE)
normalized_confusion_matrix(x, y = NULL)
normalizing_permutation(x, y = NULL)
Each cluster validity measure is a single numeric value.
normalized_confusion_matrix()
returns a numeric matrix.
normalizing_permutation()
returns a vector of indexes.
an integer vector of length n (or an object coercible to)
representing a K-partition of an n-set (e.g., a reference partition),
or a confusion matrix with K rows and L columns
(see table(x, y)
)
an integer vector of length n (or an object coercible to)
representing an L-partition of the same set (e.g., the output of a
clustering algorithm we wish to compare with x
),
or NULL (if x is an K*L confusion matrix)
whether to assume E=1 in the definition of the pair sets index index, i.e., use Eq. (20) in (Rezaei, Franti, 2016) instead of Eq. (18)
whether the result should be clipped to the unit interval, i.e., [0, 1]
Marek Gagolewski and other contributors
normalized_clustering_accuracy()
(Gagolewski, 2023)
is an asymmetric external cluster validity measure
which assumes that the label vector x
(or rows in the confusion
matrix) represents the reference (ground truth) partition.
It is an average proportion of correctly classified points in each cluster
above the worst case scenario of uniform membership assignment,
with cluster ID matching based on the solution to the maximal linear
sum assignment problem; see normalized_confusion_matrix
).
It is given by:
\(\max_\sigma \frac{1}{K} \sum_{j=1}^K \frac{c_{\sigma(j), j}-c_{\sigma(j),\cdot}/K}{c_{\sigma(j),\cdot}-c_{\sigma(j),\cdot}/K}\),
where \(C\) is a confusion matrix with \(K\) rows and \(L\) columns,
\(\sigma\) is a permutation of the set \(\{1,\dots,\max(K,L)\}\), and
\(c_{i, \cdot}=c_{i, 1}+...+c_{i, L}\) is the i-th row sum,
under the assumption that \(c_{i,j}=0\) for \(i>K\) or \(j>L\)
and \(0/0=0\).
normalized_pivoted_accuracy()
is defined as
\((\max_\sigma \sum_{j=1}^{\max(K,L)} c_{\sigma(j),j}/n-1/\max(K,L))/(1-1/\max(K,L))\),
where \(\sigma\) is a permutation of the set \(\{1,\dots,\max(K,L)\}\),
and \(n\) is the sum of all elements in \(C\).
For non-square matrices, missing rows/columns are assumed
to be filled with 0s.
pair_sets_index()
(PSI) was introduced in (Rezaei, Franti, 2016).
The simplified PSI assumes E=1 in the definition of the index,
i.e., uses Eq. (20) in the said paper instead of Eq. (18).
For non-square matrices, missing rows/columns are assumed
to be filled with 0s.
rand_score()
gives the Rand score (the "probability" of agreement
between the two partitions) and
adjusted_rand_score()
is its version corrected for chance,
see (Hubert, Arabie, 1985): its expected value is 0 given two independent
partitions. Due to the adjustment, the resulting index may be negative
for some inputs.
Similarly, fm_score()
gives the Fowlkes-Mallows (FM) score
and adjusted_fm_score()
is its adjusted-for-chance version;
see (Hubert, Arabie, 1985).
mi_score()
, adjusted_mi_score()
and
normalized_mi_score()
are information-theoretic
scores, based on mutual information,
see the definition of \(AMI_{sum}\) and \(NMI_{sum}\)
in (Vinh et al., 2010).
normalized_confusion_matrix()
computes the confusion matrix
and permutes its rows and columns so that the sum of the elements
of the main diagonal is the largest possible (by solving
the maximal assignment problem).
The function only accepts \(K \leq L\).
The reordering of the columns of a confusion matrix can be determined
by calling normalizing_permutation()
.
Also note that the built-in
table()
determines the standard confusion matrix.
Gagolewski M., A framework for benchmarking clustering algorithms, SoftwareX 20, 2022, 101270, tools:::Rd_expr_doi("10.1016/j.softx.2022.101270"), https://clustering-benchmarks.gagolewski.com.
Gagolewski M., Normalised clustering accuracy: An asymmetric external cluster validity measure, Journal of Classification, 2024, in press, tools:::Rd_expr_doi("10.1007/s00357-024-09482-2").
Hubert L., Arabie P., Comparing partitions, Journal of Classification 2(1), 1985, 193-218, esp. Eqs. (2) and (4).
Meila M., Heckerman D., An experimental comparison of model-based clustering methods, Machine Learning 42, 2001, pp. 9-29, tools:::Rd_expr_doi("10.1023/A:1007648401407").
Rezaei M., Franti P., Set matching measures for external cluster validity, IEEE Transactions on Knowledge and Data Mining 28(8), 2016, 2173-2186.
Steinley D., Properties of the Hubert-Arabie adjusted Rand index, Psychological Methods 9(3), 2004, pp. 386-396, tools:::Rd_expr_doi("10.1037/1082-989X.9.3.386").
Vinh N.X., Epps J., Bailey J., Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, Journal of Machine Learning Research 11, 2010, 2837-2854.
The official online manual of genieclust at https://genieclust.gagolewski.com/
Gagolewski M., genieclust: Fast and robust hierarchical clustering, SoftwareX 15:100722, 2021, tools:::Rd_expr_doi("10.1016/j.softx.2021.100722").
y_true <- iris[[5]]
y_pred <- kmeans(as.matrix(iris[1:4]), 3)$cluster
normalized_clustering_accuracy(y_true, y_pred)
normalized_pivoted_accuracy(y_true, y_pred)
pair_sets_index(y_true, y_pred)
pair_sets_index(y_true, y_pred, simplified=TRUE)
adjusted_rand_score(y_true, y_pred)
rand_score(table(y_true, y_pred)) # the same
adjusted_fm_score(y_true, y_pred)
fm_score(y_true, y_pred)
mi_score(y_true, y_pred)
normalized_mi_score(y_true, y_pred)
adjusted_mi_score(y_true, y_pred)
normalized_confusion_matrix(y_true, y_pred)
normalizing_permutation(y_true, y_pred)
Run the code above in your browser using DataLab