salso: Sequentially-Allocated Latent Structure Optimization

Description

This function provides a point estimate for a partition distribution using the sequentially-allocated latent structure optimization (SALSO) method. The method seeks to minimize the expectation of the Binder loss or the lower bound of the expectation of the variation of information loss. The SALSO method was presented at the workshop "Bayesian Nonparametric Inference: Dependence Structures and their Applications" in Oaxaca, Mexico on December 6, 2017. See <https://www.birs.ca/events/2017/5-day-workshops/17w5060/schedule>.

Usage

salso(
  psm,
  loss = c("VI.lb", "binder")[1],
  maxSize = 0,
  batchSize = 100,
  seconds = Inf,
  maxScans = 10,
  probExplorationProbAtZero = 0.5,
  probExplorationShape = 0.5,
  probExplorationRate = 50,
  parallel = TRUE
)

Arguments

psm

A pairwise similarity matrix, i.e., n-by-n symmetric matrix whose (i,j) element gives the (estimated) probability that items i and j are in the same subset (i.e., cluster) of a partition (i.e., clustering).

loss

Either "VI.lb" or "binder", to indicate the desired loss function.

maxSize

The maximum number of subsets (i.e, clusters). The optimization is constrained to produce solutions whose number of subsets is no more than the supplied value. If zero, the size is not constrained.

batchSize

The number of permutations to consider per batch (although the actual number of permutations per batch is a multiple of the number of cores when parallel=TRUE). Batches are sequentially performed until the most recent batch does not lead to a better result. Therefore, at least two batches are performed (unless the seconds threshold is exceeded.)

seconds

A time threshold in seconds after which the function will be curtailed (with a warning) instead of performing another batch of permutations. Note that the function could take considerably longer because the threshold is only checked after each batch is completed.

maxScans

The maximum number of reallocation scans after the initial allocation. The actual number of scans may be less than maxScans since the method stops if the result does not change between scans.

probExplorationProbAtZero

The probability of the point mass at zero for the spike-and-slab distribution of the probability of exploration, i.e. the probability of picking the second best micro-optimization (instead of the best). This probability is randomly sampled for (and constant within) each permutation.

probExplorationShape

The shape of the gamma distribution for the slab in the spike-and-slab distribution of the probability of exploration.

probExplorationRate

The rate of the gamma distribution for the slab in the spike-and-slab distribution of the probability of exploration.

parallel

Should the search use all CPU cores?

Value

A list of the following elements:

estimate: An integer vector giving a partition encoded using cluster labels.
loss: A character vector equal to the loss argument.
expectedLoss: A numeric vector of length one giving the expected loss.
nScans: An integer vector giving the number of scans used to arrive at the supplied estimate.
probExploration: The probability of picking the second best micro-optimization (instead of the best) for the permutation yielding the supplied estimate.
nPermutations: An integer giving the number of permutations actually performed.
batchSize: An integer giving the number of permutations per batch.
curtailed: A logical indicating whether the search was cut short because the time exceeded the threshold.

References

D. A. Binder (1978), Bayesian cluster analysis, Biometrika 65, 31-38.

D. B. Dahl (2006), Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model, in Bayesian Inference for Gene Expression and Proteomics, Kim-Anh Do, Peter M<U+00FC>ller, Marina Vannucci (Eds.), Cambridge University Press.

J. W. Lau and P. J. Green (2007), Bayesian model based clustering procedures, Journal of Computational and Graphical Statistics 16, 526-558. D. B. Dahl, M. A. Newton (2007), Multiple Hypothesis Testing by Clustering Treatment Effects, Journal of the American Statistical Association, 102, 517-526.

A. Fritsch and K. Ickstadt (2009), An improved criterion for clustering based on the posterior similarity matrix, Bayesian Analysis, 4, 367-391.

S. Wade and Z. Ghahramani (2018), Bayesian cluster analysis: Point estimation and credible balls. Bayesian Analysis, 13:2, 559-626.

Examples

Run this code

# NOT RUN {
# Use 'parallel=FALSE' per CRAN rules for examples but, in practice, omit this.
probs <- psm(iris.clusterings, parallel=FALSE)
salso(probs, parallel=FALSE)

# }

Run the code above in your browser using DataLab