clusterbenchstats: Run and validate many clusterings

Description

This runs the methodology explained in Hennig (2017). It runs a user-specified set of clustering methods (CBI-functions, see kmeansCBI) with several numbers of clusters on a dataset, and computes many cluster validation indexes. In order to explore the variation of these indexes, random clusterings on the data are generated, and validation indexes are standardised by use of the random clusterings in order to make them comparable and differences between values interpretable.

The function print.valstat can be used to provide weights for the cluster validation statistics, and will then compute a weighted validation index that can be used to compare all clusterings.

Usage

clusterbenchstats(data,G,diss = inherits(data, "dist"),
                                  scaling=TRUE, clustermethod,
                                  methodnames=clustermethod,
                              distmethod=rep(TRUE,length(clustermethod)),
                              ncinput=rep(TRUE,length(clustermethod)),
                              clustermethodpars,
                              npstats=FALSE,
                              trace=TRUE,
                              pamcrit=TRUE,snnk=2,
                              dnnk=2,
                              nnruns=100,kmruns=100,
                              multicore=FALSE,cores=detectCores()-1,
                              useallmethods=TRUE,
                              useallg=FALSE,...)
# S3 method for clusterbenchstats
print(x,...)

Arguments

data

data matrix or dist-object.

vector of integers. Numbers of clusters to consider.

diss

logical. If TRUE, the data matrix is assumed to be a distance/dissimilariy matrix, otherwise it's observations times variables.

scaling

either a logical or a numeric vector of length equal to the number of columns of data. If FALSE, data won't be scaled, otherwise scaling is passed on to scale as argumentscale.

clustermethod

vector of strings specifying names of CBI-functions (see kmeansCBI). These are the clustering methods to be applied.

methodnames

vector of strings with user-chosen names for clustering methods, one for every method in clustermethod. These can be used to distinguish different methods run by the same CBI-function but with different parameter values such as complete and average linkage for hclustCBI.

distmethod

vector of logicals, of the same length as clustermethod. TRUE means that the clustering method operates on distances. If diss=TRUE, all entries have to be TRUE. Otherwise, if an entry is true, the corresponding method will be applied on dist(data).

ncinput

vector of logicals, of the same length as clustermethod. TRUE indicates that the corresponding clustering method requires the number of clusters as input and will not estimate the number of clusters itself.

clustermethodpars

list of the same length as clustermethod. Specifies parameters for all involved clustering methods. Its jth entry is passed to clustermethod number k. Can be an empty entry in case all defaults are used for a clustering method. The number of clusters does not need to be specified here.

npstats

logical. If TRUE, distrsimilarity is called and the two validity statistics computed there are added. These require diss=FALSE.

trace

logical. If TRUE, some runtime information is printed.

pamcrit

logical. If TRUE, the average distance of points to their respective cluster centroids is computed (criterion of the PAM clustering method, validation criterion pamc); centroids are chosen so that they minimise this criterion for the given clustering. Passed on to cqcluster.stats.

snnk

integer. Number of neighbours used in coefficient of variation of distance to nearest within cluster neighbour, the cvnnd-statistic (clusters with nnk or fewer points are ignored for this). Passed on to cqcluster.stats.

dnnk

integer. Number of nearest neighbors to use for dissimilarity to the uniform in case that npstats=TRUE; nnk-argument to be passed on to distrsimilarity.

nnruns

integer. Number of runs of stupidknn (random clusterings).

kmruns

integer. Number of runs of stupidkcentroids (random clusterings).

multicore

logical. If TRUE, parallel computing is used through the function mclapply from package parallel; read warnings there if you intend to use this; it won't work on Windows.

cores

integer. Number of cores for parallelisation.

useallmethods

logical, to be passed on to cgrestandard. If FALSE, only random clustering results are used for standardisation. If TRUE, clustering results from all methods are used.

useallg

logical to be passed on to cgrestandard. If TRUE, standardisation uses results from all numbers of clusters in G. If FALSE, standardisation of results for a specific number of cluster only uses results from that number of clusters.

...

further arguments to be passed on to cqcluster.stats through clustatsum (no effect in print.clusterbenchstats).

object of class "clusterbenchstats".

Value

The output of clusterbenchstats is a big list of lists comprising lists cm, stat, sim, qstat, sstat, statistics

output object of cluster.magazine, see there for details. Clustering of all methods and numbers of clusters on the dataset data.

stat

object of class "valstat", see valstat.object for details. Unstandardised cluster validation statistics.

sim

output object of randomclustersim, see there. validity indexes from random clusterings used for standardisation of validation statistics on data.

qstat

object of class "valstat", see valstat.object for details. Cluster validation statistics standardised by random clusterings, output of cgrestandard based on percentages, i.e., with percentage=TRUE.

sstat

object of class "valstat", see valstat.object for details. Cluster validation statistics standardised by random clusterings, output of cgrestandard based on mean and standard deviation, i.e., with percentage=FALSE.

References

Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282

Examples

Run this code

# NOT RUN {
  
  set.seed(20000)
  options(digits=3)
  face <- rFace(10,dMoNo=2,dNoEy=0,p=2)
  clustermethod=c("kmeansCBI","hclustCBI","hclustCBI")
# A clustering method can be used more than once, with different
# parameters
  clustermethodpars <- list()
  clustermethodpars[[2]] <- clustermethodpars[[3]] <- list()
  clustermethodpars[[2]]$method <- "complete"
  clustermethodpars[[3]]$method <- "average"
  methodname <- c("kmeans","complete","average")
  cbs <-  clusterbenchstats(face,G=2:3,clustermethod=clustermethod,
    methodname=methodname,distmethod=rep(FALSE,3),
    clustermethodpars=clustermethodpars,nnruns=2,kmruns=2)
  print(cbs)
  print(cbs$qstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,1,0,1,0,1,0,0,1,1))
# The weights are weights for the validation statistics ordered as in
# cbs$qstat$statistics for computation of an aggregated index, see
# ?print.valstat.

# }

Run the code above in your browser using DataLab