This runs the methodology explained in Hennig (2017). It runs a
user-specified set of clustering methods (CBI-functions, see
kmeansCBI
) with several numbers of clusters on a dataset,
and computes many cluster validation indexes. In order to explore the
variation of these indexes, random clusterings on the data are
generated, and validation indexes are standardised by use of the
random clusterings in order to make them comparable and differences
between values interpretable.
The function print.valstat
can be used to provide
weights for the cluster
validation statistics, and will then compute a weighted validation index
that can be used to compare all clusterings.
clusterbenchstats(data,G,diss = inherits(data, "dist"),
scaling=TRUE, clustermethod,
methodnames=clustermethod,
distmethod=rep(TRUE,length(clustermethod)),
ncinput=rep(TRUE,length(clustermethod)),
clustermethodpars,
npstats=FALSE,
trace=TRUE,
pamcrit=TRUE,snnk=2,
dnnk=2,
nnruns=100,kmruns=100,
multicore=FALSE,cores=detectCores()-1,
useallmethods=TRUE,
useallg=FALSE,...)# S3 method for clusterbenchstats
print(x,...)
data matrix or dist
-object.
vector of integers. Numbers of clusters to consider.
logical. If TRUE
, the data matrix is assumed to be
a distance/dissimilariy matrix, otherwise it's observations times
variables.
either a logical or a numeric vector of length equal to
the number of columns of data
. If FALSE
, data won't be
scaled, otherwise scaling
is passed on to scale
as
argumentscale
.
vector of strings specifying names of
CBI-functions (see kmeansCBI
). These are the
clustering methods to be applied.
vector of strings with user-chosen names for
clustering methods, one for every method in
clustermethod
. These can be used to distinguish different methods
run by the same CBI-function but with
different parameter values such as complete and average linkage for
hclustCBI
.
vector of logicals, of the same length as
clustermethod
. TRUE
means that the clustering method
operates on distances. If diss=TRUE
, all entries have to be
TRUE
. Otherwise, if an entry is true, the corresponding
method will be applied on dist(data)
.
vector of logicals, of the same length as
clustermethod
. TRUE
indicates that the corresponding
clustering method requires the number of clusters as input and will
not estimate the number of clusters itself.
list of the same length as
clustermethod
. Specifies parameters for all involved
clustering methods. Its jth entry is passed to clustermethod number
k. Can be an empty entry in case all defaults are used for a
clustering method. The number of clusters does not need to be
specified here.
logical. If TRUE
, distrsimilarity
is called and the two validity statistics computed there are
added. These require diss=FALSE
.
logical. If TRUE
, some runtime information is
printed.
logical. If TRUE
, the average distance of points
to their respective cluster centroids is computed (criterion of the
PAM clustering method, validation criterion pamc
); centroids
are chosen so that they minimise
this criterion for the given clustering. Passed on to
cqcluster.stats
.
integer. Number of neighbours used in coefficient of
variation of distance to nearest within cluster neighbour, the
cvnnd
-statistic (clusters
with nnk
or fewer points are ignored for this). Passed on to
cqcluster.stats
.
integer. Number of nearest neighbors to use for
dissimilarity to the uniform in case that npstats=TRUE
;
nnk
-argument to be passed on to distrsimilarity
.
integer. Number of runs of stupidknn
(random clusterings).
integer. Number of runs of
stupidkcentroids
(random clusterings).
logical. If TRUE
, parallel computing is used
through the function mclapply
from package
parallel
; read warnings there if you intend to use this; it
won't work on Windows.
integer. Number of cores for parallelisation.
logical, to be passed on to
cgrestandard
. If FALSE
, only random clustering
results are used for standardisation. If
TRUE
, clustering results from all methods are used.
logical to be passed on to
cgrestandard
. If TRUE
, standardisation uses results
from all numbers of clusters in G
. If FALSE
,
standardisation of results for a specific number of cluster only
uses results from that number of clusters.
further arguments to be passed on to
cqcluster.stats
through clustatsum
(no
effect in print.clusterbenchstats
).
object of class "clusterbenchstats"
.
The output of clusterbenchstats
is a
big list of lists comprising lists cm, stat, sim, qstat,
sstat, statistics
output object of cluster.magazine
, see there
for details. Clustering of all methods and numbers of clusters on
the dataset data
.
object of class "valstat"
, see
valstat.object
for details. Unstandardised cluster
validation statistics.
output object of randomclustersim
, see there.
validity indexes from random clusterings used for standardisation of
validation statistics on data
.
object of class "valstat"
, see
valstat.object
for details. Cluster validation
statistics standardised by random clusterings, output of
cgrestandard
based on percentages, i.e., with
percentage=TRUE
.
object of class "valstat"
, see
valstat.object
for details. Cluster validation
statistics standardised by random clusterings, output of
cgrestandard
based on mean and standard deviation,
i.e., with percentage=FALSE
.
Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282
valstat.object
,
cluster.magazine
, kmeansCBI
,
cqcluster.stats
, clustatsum
,
cgrestandard
# NOT RUN {
set.seed(20000)
options(digits=3)
face <- rFace(10,dMoNo=2,dNoEy=0,p=2)
clustermethod=c("kmeansCBI","hclustCBI","hclustCBI")
# A clustering method can be used more than once, with different
# parameters
clustermethodpars <- list()
clustermethodpars[[2]] <- clustermethodpars[[3]] <- list()
clustermethodpars[[2]]$method <- "complete"
clustermethodpars[[3]]$method <- "average"
methodname <- c("kmeans","complete","average")
cbs <- clusterbenchstats(face,G=2:3,clustermethod=clustermethod,
methodname=methodname,distmethod=rep(FALSE,3),
clustermethodpars=clustermethodpars,nnruns=2,kmruns=2)
print(cbs)
print(cbs$qstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,1,0,1,0,1,0,0,1,1))
# The weights are weights for the validation statistics ordered as in
# cbs$qstat$statistics for computation of an aggregated index, see
# ?print.valstat.
# }
Run the code above in your browser using DataLab