This calls the function pam
or
clara
to perform a
partitioning around medoids clustering with the number of clusters
estimated by optimum average silhouette width (see
pam.object
) or Calinski-Harabasz
index (calinhara
). The Duda-Hart test
(dudahart2
) is applied to decide whether there should be
more than one cluster (unless 1 is excluded as number of clusters or
data are dissimilarities).
pamk(data,krange=2:10,criterion="asw", usepam=TRUE,
scaling=FALSE, alpha=0.001, diss=inherits(data, "dist"),
critout=FALSE, ns=10, seed=NULL, ...)
A list with components
The output of the optimal run of the
pam
-function.
the optimal number of clusters.
vector of criterion values for numbers of
clusters. crit[1]
is the p-value of the Duda-Hart test
if 1 is in krange
and diss=FALSE
.
a data matrix or data frame or something that can be
coerced into a matrix, or dissimilarity matrix or
object. See pam
for more information.
integer vector. Numbers of clusters which are to be
compared by the average silhouette width criterion. Note: average
silhouette width and Calinski-Harabasz can't estimate number of
clusters nc=1
. If 1 is included, a Duda-Hart test is applied
and 1 is estimated if this is not significant.
one of "asw"
, "multiasw"
or
"ch"
. Determines whether average silhouette width (as given
out by pam
/clara
, or
as computed by distcritmulti
if "multiasw"
is
specified; recommended for large data sets with usepam=FALSE
)
or Calinski-Harabasz is applied. Note that the original
Calinski-Harabasz index is not defined for dissimilarities; if
dissimilarity data is run with criterion="ch"
, the
dissimilarity-based generalisation in Hennig and Liao (2013) is
used.
logical. If TRUE
, pam
is
used, otherwise clara
(recommended for large
datasets with 2,000 or more observations; dissimilarity matrices can
not be used with clara
).
either a logical value or a numeric vector of length
equal to the number of variables. If scaling
is a numeric
vector with length equal to the number of variables, then each
variable is divided by the corresponding value from scaling
.
If scaling
is TRUE
then scaling is done by dividing
the (centered) variables by their root-mean-square, and if
scaling
is FALSE
, no scaling is done.
numeric between 0 and 1, tuning constant for
dudahart2
(only used for 1-cluster test).
logical flag: if TRUE
(default for dist
or
dissimilarity
-objects), then data
will be considered
as a dissimilarity matrix (and the potential number of clusters 1
will be ignored). If FALSE
, then data
will
be considered as a matrix of observations by variables.
logical. If TRUE
, the criterion value is printed
out for every number of clusters.
passed on to distcritmulti
if
criterion="multiasw"
.
passed on to distcritmulti
if
criterion="multiasw"
.
Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/
Calinski, R. B., and Harabasz, J. (1974) A Dendrite Method for Cluster Analysis, Communications in Statistics, 3, 1-27.
Duda, R. O. and Hart, P. E. (1973) Pattern Classification and Scene Analysis. Wiley, New York.
Hennig, C. and Liao, T. (2013) How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society, Series C Applied Statistics, 62, 309-369.
Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.
options(digits=3)
set.seed(20000)
face <- rFace(50,dMoNo=2,dNoEy=0,p=2)
pk1 <- pamk(face,krange=1:5,criterion="asw",critout=TRUE)
pk2 <- pamk(face,krange=1:5,criterion="multiasw",ns=2,critout=TRUE)
# "multiasw" is better for larger data sets, use larger ns then.
pk3 <- pamk(face,krange=1:5,criterion="ch",critout=TRUE)
Run the code above in your browser using DataLab