Learn R Programming

WeightedCluster (version 1.8-0)

bootclustrange: Cluster Quality Indices estimation by subsampling

Description

bootclustrange estimates the quality of the clustering based on subsamples of the data to avoid computational overload.

Usage

bootclustrange(object, seqdata, seqdist.args = list(method = "LCS"),
               R = 100, sample.size = 1000, parallel = FALSE,
               progressbar = FALSE, sampling = "clustering",
               strata = NULL)
# S3 method for bootclustrange
plot(x, stat = "noCH", legendpos = "bottomright",
                              norm = "none", withlegend = TRUE, lwd = 1,
                              col = NULL, ylab = "Indicators", 
                              xlab = "N clusters", conf.int = 0.95, 
                              ci.method = "perc", ci.alpha = 0.3, 
                              line = "median", ...)
# S3 method for bootclustrange
print(x, digits = 2, bootstat = c("mean"), ...)

Value

A clustrange object, see as.clustrange with the bootrapped values.

Arguments

object

A seqclararange object or a data.frame with the clustering to be evaluated.

seqdata

State sequence object of class stslist. The sequence data to use. Use seqdef to create such an object.

seqdist.args

List of arguments passed to seqdist for computing the distances.

R

Numeric. The number of subsamples to use.

sample.size

Numeric. The size of the subsamples, values between 1000 and 10 000 are recommended.

parallel

Logical. Whether to initialize the parallel processing of the future package using the default multisession strategy. If FALSE (default), then the current plan is used. If TRUE, multisession plan is initialized using default values.

progressbar

Logical. Whether to initialize a progressbar using the future package. If FALSE (default), then the current progress bar handlers is used . If TRUE, a new global progress bar handlers is initialized.

sampling

Character. The sampling procedure to be used: "clustering" (default) the sampling is stratified by the maximum number of clusters, use "medoids" to add the medoids in each subsamples, "strata" to stratify by the strata arguments, or "random" for random sampling.

strata

An optional stratification variable.

x

A bootclustrange object to be plotted or printed.

stat

Character. The list of statistics to plot or "noCH" to plot all statistics except "CH" and "CHsq" or "all" for all statistics. See as.clustrange for a list of possible values.

legendpos

Character. legend position, see legend.

norm

Character. Normalization method of the statistics can be one of "none" (no normalization), "range" (given as (value -min)/(max-min), "zscore" (adjusted by mean and standard deviation) or "zscoremed" (adjusted by median and median of the difference to the median).

withlegend

Logical. If FALSE, the legend is not plotted.

lwd

Numeric. Line width, see par.

col

A vector of line colors, see par. If NULL, a default set of color is used.

xlab

x axis label.

ylab

y axis label.

conf.int

Confidence to build the confidence interval (default: 0.95).

ci.method

Method used to build the confidence interval (only if bootstrap has been used, see R above). One of "none" (do not plot confidence interval), "norm" (based on normal approximation), "perc" (default, based on percentile).)

ci.alpha

alpha color value used to plot the interval.

line

Which value should be plotted by the line? One of "mean" (average over all bootstraps), "median"(default, median over all bootstraps).

digits

Number of digits to be printed.

bootstat

The summary statistic to use "mean" or "median".

...

Additionnal parameters passed to/from methods.

Details

bootclustrange estimates the quality of the clustering based on subsamples of the data to avoid computational overload. It randomly samples R times sample.size sequences from seqdata using the sampling procedure defined by the sampling arguments. In each subsample, a distance matrix is computed using the selected sequences and the seqdist.args arguments and the cluster quality indices are then estimated using as.clustrange.

The clustering can be specified either as a seqclararange object or a data.frame.

References

Studer, M., R. Sadeghi and L. Tochon (2024). Sequence Analysis for Large Databases. LIVES Working Papers 104 tools:::Rd_expr_doi("10.12682/lives.2296-1658.2024.104")

See Also

See Also as.clustrange for the list of cluster quality indices that are computed, and seqclararange for example of use