Learn R Programming

fpc (version 2.2-13)

plot.valstat: Simulation-standardised plot and print of cluster validation statistics

Description

Visualisation and print function for cluster validation output compared to results on simulated random clusterings. The print method can also be used to compute and print an aggregated cluster validation index.

Unlike for many other plot methods, the additional arguments of plot.valstat are essential. print.valstat should make good sense with the defaults, but for computing the aggregate index need to be set.

Usage

# S3 method for valstat
plot(x,simobject=NULL,statistic="sindex",
                            xlim=NULL,ylim=c(0,1),
                            nmethods=length(x)-5,
                            col=1:nmethods,cex=1,pch=c("c","f","a","n"),
                            simcol=rep(grey(0.7),4),
                         shift=c(-0.1,-1/3,1/3,0.1),include.othernc=NULL,...)

# S3 method for valstat print(x,statistics=x$statistics, nmethods=length(x)-5,aggregate=FALSE, weights=NULL,digits=2, include.othernc=NULL,...)

Value

print.valstats returns the results table as invisible object.

Arguments

x

object of class "valstat", such as sublists stat, qstat, sstat of clusterbenchstats-output.

simobject

list of simulation results as produced by randomclustersim and documented there; typically sublist sim of clusterbenchstats-output.

statistic

one of "avewithin","mnnd","variation", "diameter","gap","sindex","minsep","asw","dindex","denscut", "highdgap","pg","withinss","entropy","pamc","kdnorm","kdunif","dmode"; validation statistic to be plotted.

xlim

passed on to plot. Default is the range of all involved numbers of clusters, minimum minus 0.5 to maximum plus 0.5.

ylim

passed on to plot.

nmethods

integer. Number of clustering methods to involve (these are those from number 1 to nmethods specified in x$name).

col

colours used for the different clustering methods.

cex

passed on to plot.

pch

vector of symbols for random clustering results from stupidkcentroids, stupidkfn, stupidkaven, stupidknn. To be passed on to plot.

simcol

vector of colours used for random clustering results in order stupidkcentroids, stupidkfn, stupidkaven, stupidknn.

shift

numeric vector. Indicates the amount to which the results from stupidkcentroids, stupidkfn, stupidkaven, stupidknn are plotted to the right of their respective number of clusters (negative numbers plot to the left).

include.othernc

this indicates whether methods should be included that estimated their number of clusters themselves and gave a result outside the standard range as given by x$minG and x$maxG. If not NULL, this is a list of integer vectors of length 2. The first number is the number of the clustering method (the order is determined by argument x$name), the second number is the number of clusters for those methods that estimate the number of clusters themselves and estimated a number outside the standard range. Normally what will be used here, if not NULL, is the output parameter cm$othernc of clusterbenchstats, see also cluster.magazine.

statistics

vector of character strings specifying the validation statistics that will be included in the output (unless you want to restrict the output for some reason, the default should be fine.

aggregate

logical. If TRUE, an aggegate validation statistic will be computed as the weighted mean of the involved statistic. This requires weights to be set. In order for this to make sense, values of the validation statistics should be comparable, which is achieved by standardisation in clusterbenchstats. Accordingly, x should be the qstat or sstat-component of the clusterbenchstats-output rather than the stat-component.

weights

vector of numericals. Weights for computation of the aggregate statistic in case that aggregate=TRUE. The order of clustering methods corresponding to the weight vector is given by x$name.

digits

minimal number of significant digits, passed on to print.table.

...

no effect.

Details

Whereas print.valstat, at least with aggregate=TRUE makes more sense for the qstat or sstat-component of the clusterbenchstats-output rather than the stat-component, plot.valstat should be run with the stat-component if simobject is specified, because the simulated cluster validity statistics are unstandardised and need to be compared with unstandardised values on the dataset of interest.

print.valstat will print all values for all validation indexes and the aggregated index (in case of aggregate=TRUE and set weights will be printed last.

References

Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282

Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822

See Also

clusterbenchstats, valstat.object, cluster.magazine

Examples

Run this code
  set.seed(20000)
  options(digits=3)
  face <- rFace(10,dMoNo=2,dNoEy=0,p=2)
  clustermethod=c("kmeansCBI","hclustCBI","hclustCBI")
  clustermethodpars <- list()
  clustermethodpars[[2]] <- clustermethodpars[[3]] <- list()
  clustermethodpars[[2]]$method <- "ward.D2"
  clustermethodpars[[3]]$method <- "single"
  methodname <- c("kmeans","ward","single")
  cbs <-  clusterbenchstats(face,G=2:3,clustermethod=clustermethod,
    methodname=methodname,distmethod=rep(FALSE,3),
    clustermethodpars=clustermethodpars,nnruns=2,kmruns=2,fnruns=2,avenruns=2)
  plot(cbs$stat,cbs$sim)
  plot(cbs$stat,cbs$sim,statistic="dindex")
  plot(cbs$stat,cbs$sim,statistic="avewithin")
  pcbs <- print(cbs$sstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0))
# Some of the values are "NaN" because due to the low number of runs of
# the stupid clustering methods there is no variation. If this happens
# in a real application, nnruns etc. should be chosen higher than 2.
# Also useallg=TRUE in clusterbenchstats may help.
#
# Finding the best aggregated value:
  mpcbs <- as.matrix(pcbs[[17]][,-1])
  which(mpcbs==max(mpcbs),arr.ind=TRUE)
# row=1 refers to the first clustering method kmeansCBI,
# col=2 refers to the second number of clusters, which is 3 in g=2:3.

Run the code above in your browser using DataLab