Selection of the number of clusters via bootstrap as explained in Fang and Wang (2012). Several times 2 bootstrap samples are drawn from the data and the number of clusters is chosen by optimising an instability estimation from these pairs.
In principle all clustering methods can be used that have a
CBI-wrapper, see clusterboot
,
kmeansCBI
. However, the currently implemented
classification methods are not necessarily suitable for all of them,
see argument classification
.
nselectboot(data,B=50,distances=inherits(data,"dist"),
clustermethod=NULL,
classification="averagedist",centroidname = NULL,
krange=2:10, count=FALSE,nnk=1,
largeisgood=FALSE,...)
nselectboot
returns a list with components
kopt,stabk,stab
.
optimal number of clusters.
mean instability values for numbers of clusters (or one
minus this if largeisgood=TRUE
).
matrix of instability values for all bootstrap runs and numbers of clusters.
something that can be coerced into a matrix. The data
matrix - either an n*p
-data matrix (or data frame) or an
n*n
-dissimilarity matrix (or dist
-object).
integer. Number of resampling runs.
logical. If TRUE
, the data is interpreted as
dissimilarity matrix. If data
is a dist
-object,
distances=TRUE
automatically, otherwise
distances=FALSE
by default. This means that you have to set
it to TRUE
manually if data
is a dissimilarity matrix.
an interface function (the function name, not a
string containing the name, has to be provided!). This defines the
clustering method. See the "Details"-section of clusterboot
and kmeansCBI
for the format. Clustering methods for
nselectboot
must have a k
-argument for the number of
clusters and must otherwise follow the specifications in
clusterboot
. Note that nselectboot
won't work
with CBI-functions that implicitly already estimate the number of
clusters such as pamkCBI
; use claraCBI
if you want to run it for pam/clara clustering.
string.
This determines how non-clustered points are classified to given
clusters. Options are explained in classifdist
(if
distances=TRUE
) and classifnp
(otherwise).
Certain classification methods are connected to certain clustering
methods. classification="averagedist"
is recommended for
average linkage, classification="centroid"
is recommended for
k-means, clara and pam (with distances it will work with
claraCBI
only), classification="knn"
with
nnk=1
is recommended for single linkage and
classification="qda"
is recommended for Gaussian mixtures
with flexible covariance matrices.
string. Indicates the name of the component of
CBIoutput$result
that contains the cluster centroids in case of
classification="centroid"
, where CBIoutput
is the
output object of clustermethod
. If clustermethod
is
kmeansCBI
or claraCBI
, centroids are recognised
automatically if centroidname=NULL
. If
centroidname=NULL
and distances=FALSE
, cluster means
are computed as the cluster centroids.
integer vector; numbers of clusters to be tried.
logical. If TRUE
, numbers of clusters and
bootstrap runs are printed.
number of nearest neighbours if
classification="knn"
, see classifdist
(if
distances=TRUE
) and classifnp
(otherwise).
logical. If TRUE
, output component
stabk
is taken as one minus the original instability value
so that larger values of stabk
are better.
arguments to be passed on to the clustering method.
Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/
Fang, Y. and Wang, J. (2012) Selection of the number of clusters via the bootstrap method. Computational Statistics and Data Analysis, 56, 468-477.
classifdist
, classifnp
,
clusterboot
,kmeansCBI
set.seed(20000)
face <- rFace(50,dMoNo=2,dNoEy=0,p=2)
nselectboot(dist(face),B=2,clustermethod=disthclustCBI,
method="average",krange=5:7)
nselectboot(dist(face),B=2,clustermethod=claraCBI,
classification="centroid",krange=5:7)
nselectboot(face,B=2,clustermethod=kmeansCBI,
classification="centroid",krange=5:7)
# Of course use larger B in a real application.
Run the code above in your browser using DataLab