cvi: Cluster validity indices

Description

Compute different cluster validity indices (CVIs) of a given cluster partition, using the clustering distance measure and centroid function if applicable.

Usage

cvi(a, b = NULL, type = "valid", ..., log.base = 10)
"cvi"(a, b = NULL, type = "valid", ..., log.base = 10)
"cvi"(a, b = NULL, type = "valid", ..., log.base = 10)
"cvi"(a, b = NULL, type = "valid", ..., log.base = 10)

Arguments

An object returned by the dtwclust or tsclust function, or a vector that can be coerced to integers which indicate the cluster memeberships.

If needed, a vector that can be coerced to integers which indicate the cluster memeberships. The ground truth (if known) should be provided here.

type

Character vector indicating which indices are to be computed. See supported values below.

...

Arguments to pass to and from other methods.

log.base

Base of the logarithm to be used in the calculation of VI.

Value

The chosen CVIs

External CVIs

The first 4 CVIs are calculated via comPart, so please refer to that function.

"RI": Rand Index (to be maximized).
"ARI": Adjusted Rand Index (to be maximized).
"J": Jaccard Index (to be maximized).
"FM": Fowlkes-Mallows (to be maximized).
"VI": Variation of Information (Meila (2003); to be minimized).

Internal CVIs

The indices marked with an exclamation mark (!) calculate (or re-use if already available) the whole distance matrix between the series in the data. If you were trying to avoid this in the first place, then these CVIs might not be suitable for your application. The indices marked with a question mark (?) depend on the extracted centroids, so bear that in mind if a hierarchical procedure was used and/or the centroid function has associated randomness (such as shape_extraction with series of different length). The indices marked with a tilde (~) require the calculation of a global centroid. Since DBA and shape_extraction (for series of different length) have some randomness associated, these indices might not be appropriate for those centroids.

"Sil" (!): Silhouette index (Arbelaitz et al. (2013); to be maximized).
"D" (!): Dunn index (Arbelaitz et al. (2013); to be maximized).
"COP" (!): COP index (Arbelaitz et al. (2013); to be minimized).
"DB" (?): Davies-Bouldin index (Arbelaitz et al. (2013); to be minimized).
"DBstar" (?): Modified Davies-Bouldin index (DB*) (Kim and Ramakrishna (2005); to be minimized).
"CH" (~): Calinski-Harabasz index (Arbelaitz et al. (2013); to be maximized).
"SF" (~): Score Function (Saitta et al. (2007); to be maximized).

Additionally

"valid": Returns all valid indices depending on the type of a and whether b was provided or not.
"internal": Returns all internal CVIs. Only supported for dtwclust-class objects.
"external": Returns all external CVIs. Requires b to be provided.

Details

Clustering is commonly considered to be an unsupervised procedure, so evaluating its performance can be rather subjective. However, a great amount of effort has been invested in trying to standardize cluster evaluation metrics by using cluster validity indices (CVIs).

CVIs can be classified as internal, external or relative depending on how they are computed. Focusing on the first two, the crucial difference is that internal CVIs only consider the partitioned data and try to define a measure of cluster purity, whereas external CVIs compare the obtained partition to the correct one. Thus, external CVIs can only be used if the ground truth is known. Each index defines their range of values and whether they are to be minimized or maximized. In many cases, these CVIs can be used to evaluate the result of a clustering algorithm regardless of how the clustering works internally, or how the partition came to be.

Knowing which CVI will work best cannot be determined a priori, so they should be tested for each specific application. Usually, many CVIs are utilized and compared to each other, maybe using a majority vote to decide on a final result. Furthermore, it should be noted that many CVIs perform additional distance calculations when being computed, which can be very considerable if using DTW.

Note that, even though a fuzzy partition can be changed into a crisp one, making it compatible with many of the existing CVIs, there are also fuzzy CVIs tailored specifically to fuzzy clustering, and these may be more suitable in those situations, but have not been implemented here yet.

References

Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Perez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243-256.

Kim, M., & Ramakrishna, R. S. (2005). New indices for cluster validity assessment. Pattern Recognition Letters, 26(15), 2353-2363.

Meila, M. (2003). Comparing clusterings by the variation of information. In Learning theory and kernel machines (pp. 173-187). Springer Berlin Heidelberg.

Saitta, S., Raphael, B., & Smith, I. F. (2007). A bounded index for cluster validity. In International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp. 174-187). Springer Berlin Heidelberg.