randomclustersim: Simulation of validity indexes based on random clusterings

Description

For a given dataset this simulates random clusterings using stupidkcentroids, stupidknn, stupidkfn, and stupidkaven. It then computes and stores a set of cluster validity indexes for every clustering.

Usage

randomclustersim(datadist,datanp=NULL,npstats=FALSE,useboot=FALSE,
                      bootmethod="nselectboot",
                      bootruns=25, 
                      G,nnruns=100,kmruns=100,fnruns=100,avenruns=100,
                      nnk=4,dnnk=2,
                      pamcrit=TRUE, 
                      multicore=FALSE,cores=detectCores()-1,monitor=TRUE)

Value

List with components

nn: list, indexed by number of clusters. Every entry is a data frame with nnruns observations for every simulation run of stupidknn. The variables of the data frame are avewithin, mnnd, cvnnd, maxdiameter, widestgap, sindex, minsep, asw, dindex, denscut, highdgap, pearsongamma, withinss, entropy, if pamcrit=TRUE also pamc, if npstats=TRUE also kdnorm, kdunif. All these are cluster validation indexes; documented as values of clustatsum.
fn: list, indexed by number of clusters. Every entry is a data frame with fnruns observations for every simulation run of stupidkfn. The variables of the data frame are avewithin, mnnd, cvnnd, maxdiameter, widestgap, sindex, minsep, asw, dindex, denscut, highdgap, pearsongamma, withinss, entropy, if pamcrit=TRUE also pamc, if npstats=TRUE also kdnorm, kdunif. All these are cluster validation indexes; documented as values of clustatsum.
aven: list, indexed by number of clusters. Every entry is a data frame with avenruns observations for every simulation run of stupidkaven. The variables of the data frame are avewithin, mnnd, cvnnd, maxdiameter, widestgap, sindex, minsep, asw, dindex, denscut, highdgap, pearsongamma, withinss, entropy, if pamcrit=TRUE also pamc, if npstats=TRUE also kdnorm, kdunif. All these are cluster validation indexes; documented as values of clustatsum.
km: list, indexed by number of clusters. Every entry is a data frame with kmruns observations for every simulation run of stupidkcentroids. The variables of the data frame are avewithin, mnnd, cvnnd, maxdiameter, widestgap, sindex, minsep, asw, dindex, denscut, highdgap, pearsongamma, withinss, entropy, if pamcrit=TRUE also pamc, if npstats=TRUE also kdnorm, kdunif. All these are cluster validation indexes; documented as values of clustatsum.
nnruns: number of involved runs of stupidknn,
fnruns: number of involved runs of stupidkfn,
avenruns: number of involved runs of stupidkaven,
kmruns: number of involved runs of stupidkcentroids,
boot: if useboot=TRUE, stability value; stabk for method nselectboot; mean.pred for method prediction.strength.

Arguments

datadist: distances on which validation-measures are based, dist object or distance matrix.
datanp: optional observations times variables data matrix, see npstats.
npstats: logical. If TRUE, distrsimilarity is called and the two statistics computed there are added to the output. These are based on datanp and require datanp to be specified.
useboot: logical. If TRUE, a stability index (either nselectboot or prediction.strength) will be involved.
bootmethod: either "nselectboot" or "prediction.strength"; stability index to be used if useboot=TRUE.
bootruns: integer. Number of resampling runs. If useboot=TRUE, passed on as B to nselectboot or M to prediction.strength.
G: vector of integers. Numbers of clusters to consider.
nnruns: integer. Number of runs of stupidknn.
kmruns: integer. Number of runs of stupidkcentroids.
fnruns: integer. Number of runs of stupidkfn.
avenruns: integer. Number of runs of stupidkaven.
nnk: nnk-argument to be passed on to cqcluster.stats.
dnnk: nnk-argument to be passed on to distrsimilarity.
pamcrit: pamcrit-argument to be passed on to cqcluster.stats.
multicore: logical. If TRUE, parallel computing is used through the function mclapply from package parallel; read warnings there if you intend to use this; it won't work on Windows.
cores: integer. Number of cores for parallelisation.
monitor: logical. If TRUE, it will print some runtime information.

Author

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/

References

Hennig, C. (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Data Analysis and Applications 1: Clustering and Regression, Modeling-estimating, Forecasting and Data Mining, Volume 2, Wiley, New York 1-24, https://arxiv.org/abs/1703.09282

Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. Statistics and Computing, 30, 1523-1544, https://link.springer.com/article/10.1007/s11222-020-09958-2, https://arxiv.org/abs/2002.01822

Examples

Run this code

  set.seed(20000)
  options(digits=3)
  face <- rFace(10,dMoNo=2,dNoEy=0,p=2)
  rmx <- randomclustersim(dist(face),datanp=face,npstats=TRUE,G=2:3,
    nnruns=2,kmruns=2, fnruns=1,avenruns=1,nnk=2)
if (FALSE) {
  rmx$km # Produces slightly different but basically identical results on ATLAS
}
  rmx$aven
  rmx$fn
  rmx$nn

Run the code above in your browser using DataLab