salso: SALSO Greedy Search

Description

This function provides a partition to summarize a partition distribution using the SALSO greedy search method (Dahl, Johnson, and Müller, 2022). The implementation currently supports the minimization of several partition estimation criteria. For details on these criteria, see partition.loss.

Usage

salso(
  x,
  loss = VI(),
  maxNClusters = 0,
  nRuns = 16,
  maxZealousAttempts = 10,
  probSequentialAllocation = 0.5,
  nCores = 0,
  ...
)

Value

An integer vector giving the estimated partition, encoded using cluster labels.

Arguments

x: A \(B\)-by-\(n\) matrix, where each of the \(B\) rows represents a clustering of \(n\) items using cluster labels. For the \(b\)th clustering, items \(i\) and \(j\) are in the same cluster if x[b,i] == x[b,j].
loss: The loss function to use, as indicated by "binder", "omARI", "VI", "NVI", "ID", "NID", or the result of calling a function with these names. Also supported are "binder.psm", "VI.lb", "omARI.approx", or the result of calling a function with these names, in which case x above can optionally be a pairwise similarity matrix, i.e., \(n\)-by-\(n\) symmetric matrix whose \((i,j)\) element gives the (estimated) probability that items \(i\) and \(j\) are in the same subset (i.e., cluster) of a partition (i.e., clustering). The loss functions "binder.psm", "VI.lb", and "omARI.approx" are generally not recommended and the current implementation requires that maxZealousAttempts = 0 and probSequentialAllocation = 1.0.
maxNClusters: The maximum number of clusters that can be considered by the optimization algorithm, which has important implications for the interpretability of the resulting clustering and can greatly influence the RAM needed for the optimization algorithm. If the supplied value is zero and x is a matrix of clusterings, the optimization is constrained by the maximum number of clusters among the clusterings in x. If the supplied value is zero and x is a pairwise similarity matrix, there is no constraint.
nRuns: The number of runs to try, although the actual number may differ for the following reasons: 1. The actual number is a multiple of the number of cores specified by the nCores argument, and 2. The search is curtailed when the seconds threshold is exceeded.
maxZealousAttempts: The maximum number of attempts for zealous updates, in which entire clusters are destroyed and items are sequentially reallocated. While zealous updates may be helpful in optimization, they also take more CPU time which might be better used trying additional runs.
probSequentialAllocation: For the initial allocation, the probability of sequential allocation instead of using sample(1:K, ncol(x), TRUE), where K is set according to the maxNClusters argument.
nCores: The number of CPU cores to use, i.e., the number of simultaneous runs at any given time. A value of zero indicates to use all cores on the system.
...: Extra arguments not intended for the end user, including: 1. seconds: Instead of performing all the requested number of runs, curtail the search after the specified expected number of seconds. Note that the function will finish earlier if all the requested runs are completed. The specified seconds does not account for the overhead involved in starting the search and returning results. 2. maxScans The maximum number of full reallocation scans. The actual number of scans may be less than maxScans since the method stops if the result does not change between scans, and 3. probSingletonsInitialization: When doing a sequential allocation to obtain the initial allocation, the probability of placing the first maxNClusters randomly-selected items in singletons subsets.

References

D. B. Dahl, D. J. Johnson, and P. Müller (2022), Search Algorithms and Loss Functions for Bayesian Clustering, Journal of Computational and Graphical Statistics, 31(4), 1189-1201, tools:::Rd_expr_doi("10.1080/10618600.2022.2069779").

Examples

Run this code

# For examples, use 'nCores=1' per CRAN rules, but in practice omit this.
data(iris.clusterings)
draws <- iris.clusterings
salso(draws, loss=VI(), nRuns=1, nCores=1)
salso(draws, loss=VI(a=0.7), nRuns=1, nCores=1)
salso(draws, loss=binder(), nRuns=1, nCores=1)
salso(iris.clusterings, binder(a=NULL), nRuns=4, nCores=1)
salso(iris.clusterings, binder(a=list(nClusters=3)), nRuns=4, nCores=1)