gap: Perform gap analysis

Description

Performs the gap analysis using lga to estimate the number of clusters.

Usage

"gap"(x, K, B, criteria=c("tibshirani", "DandF","none"), nnode=NULL, scale=TRUE, ...)

Arguments

a numeric matrix.

an integer giving the maximum number of clusters to consider.

an integer giving the number of bootstraps.

criteria

a character string indicating which criteria to evaluate the gap data. One of ‘“tibshirani”’ (default),‘“DandF”’ or ‘“none”’. Can be abbreviated.

nnode

an integer of many CPUS to use for parallel processing. Defaults to NULL i.e. no parallel processing.

scale

logical. Should the data be scaled?

...

For any other arguments passed from the generic function.

Value

finished: a logical. For the “tibshirani”, was there a solution found?
nclust: a integer for the number of clusters estimated. Returns NA if nothing conclusive is found.
data: the original data set, scaled if specified in the arguments.
criteria: the criteria used.

Details

This code performs the gap analysis using lga. The gap statistic is defined as the difference between the log of the Residual Orthogonal Sum of Squared Distances (denoted $log(W_k)$) and its expected value derived using bootstrapping under the null hypothesis that there is only one cluster. In this implementation, the reference distribution used for the bootstrapping is a random uniform hypercube, transformed by the principal components of the underlying data set. For further details see Tibshirani et al (2001).

For different criteria, different rules apply. With ‘“tibshirani”’ (ibid) we calculate the gap statistic for $k = 1, \ldots, K$, stopping when $$\mbox{gap}(k) \ge \mbox{gap}(k+1) - s_{k+1}$$ where $s_(k+1)$ is a function of standard deviation of the bootstrapped estimates. With the ‘“DandF”’ criteria from Dudoit et al (2002), we calculate the gap statistic for all values of $k = 1, \ldots, K$, selecting the number of clusters as $$\hat{k} = \mbox{ smallest } k \ge 1 \mbox{such that gap}(k) \ge \mbox{gap}(k^*) - s_{k*}$$ where $kstar = argmax_(k >= 1) gap(k)$. Finally, for the criteria “none”, no rules are applied, and just the gap data is returned. As lga is ostensibly unsupervised in this case, the parameter niter is set to 20 to ensure convergence.

This function is parallel computing aware via the nnode argument, and works with the package snow. In order to use parallel computing, one of MPI (e.g. lamboot) or PVM is necessary. For further details, see the documentation for snow.

References

Tibshirani, R. and Walther, G. and Hastie, T. (2001) ‘Estimating the number of clusters in a data set via the gap statistic’, J. R. Statist. Soc. B 63, 411--423. Dudoit, S. and Fridlyand, J. (2002) ‘A prediction-based resampling method for estimating the number of clusters in a dataset’, Genome Biology 3. Van Aelst, S. and Wang, X. and Zamar, R. and Zhu, R. (2006) ‘Linear Grouping Using Orthogonal Regression’, Computational Statistics \& Data Analysis 50, 1287--1312.

Examples

Run this code


## Synthetic example
## Make a dataset with 2 clusters in 2 dimensions

library(MASS)
set.seed(1234)
X <- rbind(mvrnorm(n=100, mu=c(1, -2), Sigma=diag(0.1, 2) + 0.9),
           mvrnorm(n=100, mu=c(1, 1), Sigma=diag(0.1, 2) + 0.9))

gap(X, K=4, B=20)

## to run this using parallel processing with 4 nodes, the equivalent
## code would be

## Not run: gap(X, K=4, B=20, nnode=4)


## Quakes data (from package:datasets)
## Including the first two dimensions versus three dimensions
## yields different results

set.seed(1234)
## Not run: 
# gap(quakes[,1:2], K=4, B=20)
# gap(quakes[,1:3], K=4, B=20)
# ## End(Not run)

library(maps)
lgaout1 <- lga(quakes[,1:2], k=3)
plot(lgaout1)

lgaout2 <- lga(quakes[,1:3], k=2)
plot(lgaout2)

## Let's put this in context
par(mfrow=c(1,2))
map("world", xlim=range(quakes[,2]), ylim=range(quakes[,1])); box()
points(quakes[,2], quakes[,1], pch=lgaout1$cluster, col=lgaout1$cluster)

map("world", xlim=range(quakes[,2]), ylim=range(quakes[,1])); box()
points(quakes[,2], quakes[,1], pch=lgaout2$cluster, col=lgaout2$cluster)

Run the code above in your browser using DataLab