This function provides a partition to a subset of items which has high marginal probability based on samples from a partition distribution using the conditional high inclusion probability subset (CHIPS) partition greedy search method (Barrientos, Page, Dahl, Dunson, 2024).
chips(
partitions,
threshold = 0,
nRuns = 64,
intermediateResults = identical(threshold, 0),
allCandidates = FALSE,
andSALSO = !intermediateResults && !allCandidates,
loss = VI(a = 1),
maxNClusters = 0,
initialPartition = integer(0),
nCores = 0
)
A list containing:
chips_partition
: If intermediateResults
is FALSE
, an integer vector giving the
estimated subset partition, encoded using cluster labels with -1
indicating not allocated. If TRUE
, an integer matrix with intermediate subset
partitions in the rows.
n_items
: Number of items in the estimated subset partition.
probability
: Monte Carlo estimate of the probability of the subset partition.
auc
: If intermediateResults
is TRUE
, this element is provided and gives
the area under the probability curve as a function of the number of clusters
after scaling to be between 0 and 1.
chips_and_salso_partition
: If andSALSO
is TRUE
, this element is provided and
gives an integer vector giving the
estimated partition of all items based on CHiPS until the threshold
is met
and using SALSO to allocate the rest.
A \(B\)-by-\(n\) matrix, where each of the \(B\) rows
represents a clustering of \(n\) items using cluster labels. For the
\(b\)th clustering, items \(i\) and \(j\) are in the same cluster if
x[b, i] == x[b, j]
.
The minimum marginal probability for the subpartition. Values closer to 1.0 will yield a partition of fewer items and values closer to 0.0 will yield a partition of more items.
The number of runs to try, where the best result is returned.
Should intermediate subset partitions be returned?
Should all the final subset partitions from multiple runs be returned?
Should the resulting incomplete partition be completed using SALSO?
When andSALSO = TRUE
, the loss function to use, as
indicated by "binder"
, "VI"
, or the result of calling a
function with these names (which permits unequal costs).
The maximum number of clusters that can be considered by
SALSO, which has important implications for the interpretability of the
resulting clustering and can greatly influence the RAM needed for the
optimization algorithm. If the supplied value is zero, the optimization is
constrained by the maximum number of clusters among the clusterings in
x
.
An vector of length \(n\) containing cluster labels for items that are initially clustered or \(-1\) for items that are not initially clustered. As a special case, vector of length 0 is equivalent to a vector of length \(n\) with \(-1\) for all values.
The number of CPU cores to use, i.e., the number of simultaneous runs at any given time. A value of zero indicates to use all cores on the system.
# For examples, use 'nCores = 1' per CRAN rules, but in practice omit this.
data(iris.clusterings)
draws <- iris.clusterings
all <- chips(draws, nRuns = 1, nCores = 1)
plot(all$n_items, all$probability)
x <- chips(draws, threshold = 0.5, nCores = 1)
table(x$chips_partition)
which(x$chips_partition != -1)
x
Run the code above in your browser using DataLab