chips: CHIPS Partition Greedy Search

Description

This function provides a partition to a subset of items which has high marginal probability based on samples from a partition distribution using the conditional high inclusion probability subset (CHIPS) partition greedy search method (Barrientos, Page, Dahl, Dunson, 2024).

Usage

chips(
  partitions,
  threshold = 0,
  nRuns = 64,
  intermediateResults = identical(threshold, 0),
  allCandidates = FALSE,
  andSALSO = !intermediateResults && !allCandidates,
  loss = VI(a = 1),
  maxNClusters = 0,
  initialPartition = integer(0),
  nCores = 0
)

Value

A list containing:

chips_partition: If intermediateResults is FALSE, an integer vector giving the estimated subset partition, encoded using cluster labels with -1 indicating not allocated. If TRUE, an integer matrix with intermediate subset partitions in the rows.
n_items: Number of items in the estimated subset partition.
probability: Monte Carlo estimate of the probability of the subset partition.
auc: If intermediateResults is TRUE, this element is provided and gives the area under the probability curve as a function of the number of clusters after scaling to be between 0 and 1.
chips_and_salso_partition: If andSALSO is TRUE, this element is provided and gives an integer vector giving the estimated partition of all items based on CHiPS until the threshold is met and using SALSO to allocate the rest.

Arguments

partitions: A \(B\)-by-\(n\) matrix, where each of the \(B\) rows represents a clustering of \(n\) items using cluster labels. For the \(b\)th clustering, items \(i\) and \(j\) are in the same cluster if x[b, i] == x[b, j].
threshold: The minimum marginal probability for the subpartition. Values closer to 1.0 will yield a partition of fewer items and values closer to 0.0 will yield a partition of more items.
nRuns: The number of runs to try, where the best result is returned.
intermediateResults: Should intermediate subset partitions be returned?
allCandidates: Should all the final subset partitions from multiple runs be returned?
andSALSO: Should the resulting incomplete partition be completed using SALSO?
loss: When andSALSO = TRUE, the loss function to use, as indicated by "binder", "VI", or the result of calling a function with these names (which permits unequal costs).
maxNClusters: The maximum number of clusters that can be considered by SALSO, which has important implications for the interpretability of the resulting clustering and can greatly influence the RAM needed for the optimization algorithm. If the supplied value is zero, the optimization is constrained by the maximum number of clusters among the clusterings in x.
initialPartition: An vector of length \(n\) containing cluster labels for items that are initially clustered or \(-1\) for items that are not initially clustered. As a special case, vector of length 0 is equivalent to a vector of length \(n\) with \(-1\) for all values.
nCores: The number of CPU cores to use, i.e., the number of simultaneous runs at any given time. A value of zero indicates to use all cores on the system.

Examples

Run this code

# For examples, use 'nCores = 1' per CRAN rules, but in practice omit this.
data(iris.clusterings)
draws <- iris.clusterings

all <- chips(draws, nRuns = 1, nCores = 1)
plot(all$n_items, all$probability)

x <- chips(draws, threshold = 0.5, nCores = 1)
table(x$chips_partition)
which(x$chips_partition != -1)
x

Run the code above in your browser using DataLab