cascadeKM: K-means partitioning using a range of values of K

Description

This function is a wrapper for the kmeans function. It creates several partitions forming a cascade from a small to a large number of groups.

Usage

cascadeKM(data, inf.gr, sup.gr, iter = 100, criterion = "calinski")
cIndexKM(y, x, index = "all")
## S3 method for class 'cascadeKM':
plot(x, min.g, max.g, grpmts.plot = TRUE, 
     sortg = FALSE, gridcol = NA, ...)

Arguments

data

The data matrix. The objects (samples) are the rows.

inf.gr

The number of groups for the partition with the smallest number of groups of the cascade (min).

sup.gr

The number of groups for the partition with the largest number of groups of the cascade (max).

iter

The number of random starting configurations for each value of $K$.

criterion

The criterion that will be used to select the best partition. The default value is "calinski", which refers to the Calinski-Harabasz (1974) criterion. The simple structure index ("ssi") is also available. Other indice

Object of class "kmeans" returned by a clustering algorithm such as kmeans

Data matrix where columns correspond to variables and rows to observations, or the plotting object in plot

index

The available indices are: "calinski" and "ssi". Type "all" to obtain both indices. Abbreviations of these names are also accepted.

min.g, max.g

The minimum and maximum numbers of groups to be displayed.

grpmts.plot

Show the plot (TRUE or FALSE).

sortg

Sort the objects as a function of their group membership to produce a more easily interpretatable graph. See Details. The original object names are kept; they are used as labels in the output table x, although not in the graph. I

gridcol

The colour of the grid lines in the plots. NA, which is the default value, removes the grid lines.

...

Other parameters to the functions (ignored).

Details

The function creates several partitions formimg a cascade from a small to a large number of groups formed by kmeans. Most of the work is performed by function cIndex which is based on the clustIndex function (package cclust). Some of the criteria were removed from this version because computation errors were generated when only one object was found in a group. The default value is "calinski", which refers to the well-known Calinski-Harabasz (1974) criterion. The other available index is the simple structure index "ssi". In the case of groups of equal sizes, "calinski" is generally a good criterion to indicate the correct number of groups. Users should not take its indications literally when the groups are not equal in size. Type "all" to obtain both indices. The indices are defined as: [object Object],Function cascadeKM returns an object of class cascadeKM with items: partition{ Table with the partitions found for different numbers of groups $K$, from $K$ = inf.gr to $K$ = sup.gr. } results{ Values of the criterion to select the best partition. } criterion{ The name of the criterion used. } size{ The number of objects found in each group, for all partitions (columns). }

Function cIndex returns a vector with the index values. The maximum value of these indices is supposed to indicate the best partition. These indices work best with groups of equal sizes. When the groups are not of equal sizes, one should not put too much faith in the maximum of these indices, and also explore the groups corresponding to other values of $K$.

Calinski, T. and J. Harabasz. 1974. A dendrite method for cluster analysis. Commun. Stat. 3: 1-27. Gower, J. C. 1966. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53: 325-338. Legendre, P. & L. Legendre. 1998. Numerical ecology, 2nd English edition. Elsevier Science BV, Amsterdam. Milligan, G. W. & M. C. Cooper. 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159-179.

Weingessel, A., Dimitriadou, A. and Dolnicar, S. An Examination Of Indexes For Determining The Number Of Clusters In Binary Data Sets, http://www.wu-wien.ac.at/am/wp99.htm#29

[object Object],[object Object]

kmeans, clustIndex.

# Partitioning a (10 x 10) data matrix of random numbers mat <- matrix(runif(100),10,10) res <- cascadeKM(mat, 2, 5, iter = 25, criterion = 'calinski') toto <- plot(res) # Partitioning an autocorrelated time series vec <- sort(matrix(runif(30),30,1)) res <- cascadeKM(vec, 2, 5, iter = 25, criterion = 'calinski') toto <- plot(res) # Partitioning a large autocorrelated time series # Note that we remove the grid lines vec <- sort(matrix(runif(1000),1000,1)) res <- cascadeKM(vec, 2, 7, iter = 10, criterion = 'calinski') toto <- plot(res, gridcol=NA) cluster