cascadeKM: K-means partitioning using a range of values of K

Description

This function is a wrapper for the kmeans function. It creates several partitions forming a cascade from a small to a large number of groups.

Usage

cascadeKM(data, inf.gr, sup.gr, iter = 100, criterion = "calinski")
cIndexKM (y, x, index = "all")
## S3 method for class 'cascadeKM':
plot(x, min.g, max.g, grpmts.plot = TRUE, 
    sortg = FALSE, gridcol = NA, ...)

Arguments

data

The data matrix. The objects are the rows.

inf.gr

The number of groups for the partition with the smallest number of groups of the cascade (min).

sup.gr

The number of groups for the partition with the largest number of groups of the cascade (max).

iter

The number of random starting configurations for each value of $K$.

criterion

The criterion that will be used to select the best partition. The default value is "calinski", which refers to the Calinski-Harabasz (1974) criterion. The simple structure index, "ssi", is also available.

Object of class "kmeans" returned by a clustering algorithm such as kmeans

Data matrix where columns correspond to variables and rows to observations, or the plotting object in plot

index

The available indices are: "calinski" and "ssi". Type "all" to obtain both indices. Abbreviations of these names are also accepted.

min.g, max.g

The minimum and maximum numbers of groups to be displayed.

grpmts.plot

Show the plot (TRUE or FALSE).

sortg

Sort the objects as a function of their group membership to produce a more easily interpretatable graph. See Details. The original object names are kept; they are used as labels in the output table x, although not in the graph. I

gridcol

The colour of the grid lines in the plots. NA, which is the default value, removes the grid lines.

...

Other parameters to the functins (ignored).

Value

Function cascadeKM returns an object of class cascadeKM with items:
partitionTable with the partitions found for different numbers of groups $K$, from $K$ = inf.gr to $K$ = sup.gr.
resultsValues of the criterion to select the best partition.
criterionThe name of the criterion used.
sizeThe number of objects found in each group, for all partitions (columns).
Function cIndex returns a vector with the index values. The maximum value of these indices is supposed to indicate the best partition. These indices work best with groups of equal sizes. When the groups are not of equal sizes, one should not put too much faith in the maximum of these indices, and also explore the groups corresponding to other values of $K$.

Details

The function creates several partitions formimg a cascade from a small to a large number of groups formed by kmeans. The most of the work is performed by function cIndex s based on the clustIndex function. Some of the criteria were removed from this version because computation errors were generated when only one object was found in a group. The default value is "calinski", which refers to the well-known Calinski-Harabasz (1974) criterion. The other available index is the simple structure index "ssi". In the case of groups of equal sizes, "calinski" is generally a good criterion to indicate the correct number of groups. Users should not take its indications literally when the groups are not equal in size. Type "all" to obtain both indices. The indices are defined as: [object Object]

References

Calinski, T. and J. Harabasz. 1974. A dendrite method for cluster analysis. Commun. Stat. 3: 1-27. Gower, J. C. 1966. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53: 325-338. Legendre, P. & L. Legendre. 1998. Numerical ecology, 2nd English edition. Elsevier Science BV, Amsterdam. Milligan, G. W. & M. C. Cooper. 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159-179.

Weingessel, A., Dimitriadou, A. and Dolnicar, S. An Examination Of Indexes For Determining The Number Of Clusters In Binary Data Sets, http://www.wu-wien.ac.at/am/wp99.htm#29

Examples

Run this code

# Partitioning a (10 x 10) data matrix of random numbers
 mat <- matrix(runif(100),10,10)
 res <- cascadeKM(mat, 2, 5, iter = 25, criterion = 'calinski') 
 toto <- plot(res)
 
 # Partitioning an autocorrelated time series
 vec <- sort(matrix(runif(30),30,1))
 res <- cascadeKM(vec, 2, 5, iter = 25, criterion = 'calinski')
 toto <- plot(res)
 
 # Partitioning a large autocorrelated time series
 # Note that we remove the grid lines
 vec <- sort(matrix(runif(1000),1000,1))
 res <- cascadeKM(vec, 2, 7, iter = 10, criterion = 'calinski')
 toto <- plot(res, gridcol=NA)

Run the code above in your browser using DataLab