plot.valstat: Simulation-standardised plot and print of cluster validation statistics

Description

Visualisation and print function for cluster validation output compared to results on simulated random clusterings. The print method can also be used to compute and print an aggregated cluster validation index.

Unlike for many other plot methods, the additional arguments of plot.valstat are essential. print.valstat should make good sense with the defaults, but for computing the aggregate index need to be set.

Usage

# S3 method for valstat
plot(x,simobject=NULL,statistic="sindex",
                            xlim=NULL,ylim=c(0,1),
                            nmethods=length(x)-5,
                            col=1:nmethods,cex=1,pch=c("c","f","a","n"),
                            simcol=rep(grey(0.7),4),
                         shift=c(-0.1,-1/3,1/3,0.1),include.othernc=NULL,...)

# S3 method for valstat
print(x,statistics=x$statistics,
                          nmethods=length(x)-5,aggregate=FALSE,
                          weights=NULL,digits=2,
                          include.othernc=NULL,...)

Value

print.valstats returns the results table as invisible object.

Arguments

x: object of class "valstat", such as sublists stat, qstat, sstat of clusterbenchstats-output.
simobject: list of simulation results as produced by randomclustersim and documented there; typically sublist sim of clusterbenchstats-output.
statistic: one of "avewithin","mnnd","variation", "diameter","gap","sindex","minsep","asw","dindex","denscut", "highdgap","pg","withinss","entropy","pamc","kdnorm","kdunif","dmode"; validation statistic to be plotted.
xlim: passed on to plot. Default is the range of all involved numbers of clusters, minimum minus 0.5 to maximum plus 0.5.
ylim: passed on to plot.
nmethods: integer. Number of clustering methods to involve (these are those from number 1 to nmethods specified in x$name).
col: colours used for the different clustering methods.
cex: passed on to plot.
pch: vector of symbols for random clustering results from stupidkcentroids, stupidkfn, stupidkaven, stupidknn. To be passed on to plot.
simcol: vector of colours used for random clustering results in order stupidkcentroids, stupidkfn, stupidkaven, stupidknn.
shift: numeric vector. Indicates the amount to which the results from stupidkcentroids, stupidkfn, stupidkaven, stupidknn are plotted to the right of their respective number of clusters (negative numbers plot to the left).
include.othernc: this indicates whether methods should be included that estimated their number of clusters themselves and gave a result outside the standard range as given by x$minG and x$maxG. If not NULL, this is a list of integer vectors of length 2. The first number is the number of the clustering method (the order is determined by argument x$name), the second number is the number of clusters for those methods that estimate the number of clusters themselves and estimated a number outside the standard range. Normally what will be used here, if not NULL, is the output parameter cm$othernc of clusterbenchstats, see also cluster.magazine.
statistics: vector of character strings specifying the validation statistics that will be included in the output (unless you want to restrict the output for some reason, the default should be fine.
aggregate: logical. If TRUE, an aggegate validation statistic will be computed as the weighted mean of the involved statistic. This requires weights to be set. In order for this to make sense, values of the validation statistics should be comparable, which is achieved by standardisation in clusterbenchstats. Accordingly, x should be the qstat or sstat-component of the clusterbenchstats-output rather than the stat-component.
weights: vector of numericals. Weights for computation of the aggregate statistic in case that aggregate=TRUE. The order of clustering methods corresponding to the weight vector is given by x$name.
digits: minimal number of significant digits, passed on to print.table.
...: no effect.

Author

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/

Details

Whereas print.valstat, at least with aggregate=TRUE makes more sense for the qstat or sstat-component of the clusterbenchstats-output rather than the stat-component, plot.valstat should be run with the stat-component if simobject is specified, because the simulated cluster validity statistics are unstandardised and need to be compared with unstandardised values on the dataset of interest.

print.valstat will print all values for all validation indexes and the aggregated index (in case of aggregate=TRUE and set weights will be printed last.

References

Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282

Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822

Examples

Run this code

  set.seed(20000)
  options(digits=3)
  face <- rFace(10,dMoNo=2,dNoEy=0,p=2)
  clustermethod=c("kmeansCBI","hclustCBI","hclustCBI")
  clustermethodpars <- list()
  clustermethodpars[[2]] <- clustermethodpars[[3]] <- list()
  clustermethodpars[[2]]$method <- "ward.D2"
  clustermethodpars[[3]]$method <- "single"
  methodname <- c("kmeans","ward","single")
  cbs <-  clusterbenchstats(face,G=2:3,clustermethod=clustermethod,
    methodname=methodname,distmethod=rep(FALSE,3),
    clustermethodpars=clustermethodpars,nnruns=2,kmruns=2,fnruns=2,avenruns=2)
  plot(cbs$stat,cbs$sim)
  plot(cbs$stat,cbs$sim,statistic="dindex")
  plot(cbs$stat,cbs$sim,statistic="avewithin")
  pcbs <- print(cbs$sstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0))
# Some of the values are "NaN" because due to the low number of runs of
# the stupid clustering methods there is no variation. If this happens
# in a real application, nnruns etc. should be chosen higher than 2.
# Also useallg=TRUE in clusterbenchstats may help.
#
# Finding the best aggregated value:
  mpcbs <- as.matrix(pcbs[[17]][,-1])
  which(mpcbs==max(mpcbs),arr.ind=TRUE)
# row=1 refers to the first clustering method kmeansCBI,
# col=2 refers to the second number of clusters, which is 3 in g=2:3.

Run the code above in your browser using DataLab