Learn R Programming

cstab (version 0.2)

cDistance: Selection of number of clusters via distance-based measures

Description

Selection of number of clusters via gap statistic, jump statistic, and slope statistic.

Usage

cDistance(data, kseq, method = "kmeans", linkage = "complete",
  kmIter = 10, gapIter = 10)

Arguments

data

a n x p data matrix of type numeric.

kseq

a vector with considered numbers clusters k > 1

method

character string indicating the clustering algorithm. 'kmeans' for the k-means algorithm, 'hierarchical' for hierarchical clustering.

linkage

character specifying the linkage criterion, in case type='hierarchical'. The available options are "single", "complete", "average", "mcquitty", "ward.D", "ward.D2", "centroid" or "median". See hclust.

kmIter

integer specifying the the number of restarts of the k-means algorithm in order to avoid local minima.

gapIter

integer specifying the number of simulated datasets to compute the gap statistic (see Tibshirani et al., 2001).

Value

A list with the optimal numbers of cluster determined by the gap statistic (Tibshirani et al., 2001), the jump Statistic (Sugar & James, 2011) and the slope statistic (Fujita et al., 2014). Along the function returns the gap, jump and slope for each k in kseq.

References

Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.

Sugar, C. A., & James, G. M. (2011). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 750-763,

Fujita, A., Takahashi, D. Y., & Patriota, A. G. (2014). A non-parametric method to estimate the number of clusters. Computational Statistics & Data Analysis, 73, 27-39.

Examples

Run this code

  # Generate Data from Gaussian Mixture
  s <- .1
  n <- 50
  data <- rbind(cbind(rnorm(n, 0, s), rnorm(n, 0, s)),
                cbind(rnorm(n, 1, s), rnorm(n, 1, s)),
                cbind(rnorm(n, 0, s), rnorm(n, 1, s)),
                cbind(rnorm(n, 1, s), rnorm(n, 0, s)))
  plot(data)

 # Selection of Number of Clusters using Distance-based Measures
 cDistance(data, kseq=2:10)
 

Run the code above in your browser using DataLab