This function provides a partition to summarize a partition distribution
using the SALSO greedy search method (Dahl, Johnson, and Müller, 2022). The
implementation currently supports the minimization of several partition
estimation criteria. For details on these criteria, see
partition.loss
.
salso(
x,
loss = VI(),
maxNClusters = 0,
nRuns = 16,
maxZealousAttempts = 10,
probSequentialAllocation = 0.5,
nCores = 0,
...
)
An integer vector giving the estimated partition, encoded using cluster labels.
A \(B\)-by-\(n\) matrix, where each of the \(B\) rows
represents a clustering of \(n\) items using cluster labels. For the
\(b\)th clustering, items \(i\) and \(j\) are in the same cluster if
x[b,i] == x[b,j]
.
The loss function to use, as indicated by "binder"
,
"omARI"
, "VI"
, "NVI"
, "ID"
, "NID"
, or
the result of calling a function with these names. Also supported are
"binder.psm"
, "VI.lb"
, "omARI.approx"
, or the result
of calling a function with these names, in which case x
above can
optionally be a pairwise similarity matrix, i.e., \(n\)-by-\(n\)
symmetric matrix whose \((i,j)\) element gives the (estimated)
probability that items \(i\) and \(j\) are in the same subset (i.e.,
cluster) of a partition (i.e., clustering). The loss functions
"binder.psm"
, "VI.lb"
, and "omARI.approx"
are
generally not recommended and the current implementation requires that
maxZealousAttempts = 0
and probSequentialAllocation = 1.0
.
The maximum number of clusters that can be considered by
the optimization algorithm, which has important implications for the
interpretability of the resulting clustering and can greatly influence the
RAM needed for the optimization algorithm. If the supplied value is zero
and x
is a matrix of clusterings, the optimization is constrained by
the maximum number of clusters among the clusterings in x
. If the
supplied value is zero and x
is a pairwise similarity matrix, there
is no constraint.
The number of runs to try, although the actual number may differ
for the following reasons: 1. The actual number is a multiple of the number
of cores specified by the nCores
argument, and 2. The search is
curtailed when the seconds
threshold is exceeded.
The maximum number of attempts for zealous updates, in which entire clusters are destroyed and items are sequentially reallocated. While zealous updates may be helpful in optimization, they also take more CPU time which might be better used trying additional runs.
For the initial allocation, the probability
of sequential allocation instead of using sample(1:K, ncol(x),
TRUE)
, where K
is set according to the maxNClusters
argument.
The number of CPU cores to use, i.e., the number of simultaneous runs at any given time. A value of zero indicates to use all cores on the system.
Extra arguments not intended for the end user, including: 1.
seconds
: Instead of performing all the requested number of runs,
curtail the search after the specified expected number of seconds. Note
that the function will finish earlier if all the requested runs are
completed. The specified seconds does not account for the overhead involved
in starting the search and returning results. 2. maxScans
The
maximum number of full reallocation scans. The actual number of scans may
be less than maxScans
since the method stops if the result does not
change between scans, and 3. probSingletonsInitialization
: When
doing a sequential allocation to obtain the initial allocation, the
probability of placing the first maxNClusters
randomly-selected
items in singletons subsets.
D. B. Dahl, D. J. Johnson, and P. Müller (2022), Search Algorithms and Loss Functions for Bayesian Clustering, Journal of Computational and Graphical Statistics, 31(4), 1189-1201, tools:::Rd_expr_doi("10.1080/10618600.2022.2069779").
partition.loss
, psm
,
summary.salso.estimate
, dlso
# For examples, use 'nCores=1' per CRAN rules, but in practice omit this.
data(iris.clusterings)
draws <- iris.clusterings
salso(draws, loss=VI(), nRuns=1, nCores=1)
salso(draws, loss=VI(a=0.7), nRuns=1, nCores=1)
salso(draws, loss=binder(), nRuns=1, nCores=1)
salso(iris.clusterings, binder(a=NULL), nRuns=4, nCores=1)
salso(iris.clusterings, binder(a=list(nClusters=3)), nRuns=4, nCores=1)
Run the code above in your browser using DataLab