Learn R Programming

cclust (version 0.6-26)

clustIndex: Cluster Indexes

Description

y is the result of a clustering algorithm of class such as "cclust". This function is calculating the values of several clustering indexes. The values of the indexes can be independently used in order to determine the number of clusters existing in a data set.

Usage

clustIndex ( y, x, index = "all" )

Value

Returns an vector with the indexes values.

Arguments

y

Object of class "cclust" returned by a clustering algorithm such as kmeans

x

Data matrix where columns correspond to variables and rows to observations

index

The indexes that are calculated "calinski", "cindex", "db", "hartigan", "ratkowsky", "scott", "marriot", "ball", "trcovw", "tracew", "friedman", "rubin", "ssi", "likelihood", and "all" for all the indexes. Abbreviations of these names are also accepted.

Author

Evgenia Dimitriadou and Andreas Weingessel

Details

The description of the indexes is categorized into 3 groups, based on the statistics mainly used to compute them.

The first group is based on the sum of squares within (\(SSW\)) and between (\(SSB\)) the clusters. These statistics measure the dispersion of the data points in a cluster and between the clusters respectively. These indexes are:

calinski:

\((SSB/(k-1))/(SSW/(n-k))\), where \(n\) is the number of data points and \(k\) is the number of clusters.

hartigan:

then \(\log(SSB/SSW)\).

ratkowsky:

\(mean(\sqrt{(varSSB/varSST)})\), where \(varSSB\) stands for the \(SSB\) for every variable and \(varSST\) for the total sum of squares for every variable.

ball:

\(SSW/k\), where \(k\) is the number of clusters.

The second group is based on the statistics of \(T\), i.e., the scatter matrix of the data points, and \(W\), which is the sum of the scatter matrices in every group. These indexes are:

scott:

\(n\log(|T|/|W|)\), where \(n\) is the number of data points and \(|\cdot|\) stands for the determinant of a matrix.

marriot:

\(k^2 |W|\), where \(k\) is the number of clusters.

trcovw:

\(Trace Cov W\).

tracew:

\(Trace W\).

friedman:

\(Trace W^{(-1)} B\), where \(B\) is the scatter matrix of the cluster centers.

rubin:

\(|T|/|W|\).

The third group consists of four algorithms not belonging to the previous ones and not having anything in common.

cindex:

if the data set is binary, then while the C-Index is a cluster similarity measure, is expressed as:
\([d_{(w)}-\min(d_{(w)})]/[\max(d_{(w)})-\min(d_{(w)})]\), where \(d_{(w)}\) is the sum of all \(n_{(d)}\) within cluster distances, \(\min(d_{(w)})\) is the sum of the \(n_{(d)}\) smallest pairwise distances in the data set, and \(\max (d_{(w)})\) is the sum of the \(n_{(d)}\) biggest pairwise distances. In order to compute the C-Index all pairwise distances in the data set have to be computed and stored. In the case of binary data, the storage of the distances is creating no problems since there are only a few possible distances. However, the computation of all distances can make this index prohibitive for large data sets.

db:

\(R=(1/n)*sum(R_{(i)})\) where \(R_{(i)}\) stands for the maximum value of \(R_{(ij)}\) for \(i\neq j\), and \(R_{(ij)}\) for \(R_{(ij)}=(SSW_{(i)}+SSW_{(j)})/DC_{(ij)}\), where \(DC_{(ij)}\) is the distance between the centers of two clusters \(i, j\).

likelihood:

under the assumption of independence of the variables within a cluster, a cluster solution can be regarded as a mixture model for the data, where the cluster centers give the probabilities for each variable to be \(1\). Therefore, the negative Log-likelihood can be computed and used as a quantity measure for a cluster solution. Note that the assumptions for applying special penalty terms, like in AIC or BIC, are not fulfilled in this model, and also they show no effect for these data sets.

ssi:

this ``Simple Structure Index'' combines three elements which influence the interpretability of a solution, i.e., the maximum difference of each variable between the clusters, the sizes of the most contrasting clusters and the deviation of a variable in the cluster centers compared to its overall mean. These three elements are multiplicatively combined and normalized to give a value between \(0\) and \(1\).

References

Andreas Weingessel, Evgenia Dimitriadou and Sara Dolnicar, An Examination Of Indexes For Determining The Number Of Clusters In Binary Data Sets,
https://epub.wu.ac.at/1542/
and the references therein.

See Also

cclust, kmeans

Examples

Run this code
# a 2-dimensional example
x<-rbind(matrix(rnorm(100,sd=0.3),ncol=2),
         matrix(rnorm(100,mean=1,sd=0.3),ncol=2))
cl<-cclust(x,2,20,verbose=TRUE,method="kmeans")
resultindexes <- clustIndex(cl,x, index="all")
resultindexes   

Run the code above in your browser using DataLab