salso: Sequentially-Allocated Latent Structure Optimization

Description

This function provides a point estimate for a partition distribution using the sequentially-allocated latent structure optimization (SALSO) method. The method seeks to minimize the expectation of the Binder loss or the lower bound of the expectation of the variation of information loss. The SALSO method was presented at the workshop "Bayesian Nonparametric Inference: Dependence Structures and their Applications" in Oaxaca, Mexico on December 6, 2017. See <https://www.birs.ca/events/2017/5-day-workshops/17w5060/schedule>.

Usage

salso(
  psm,
  loss = c("VI.lb", "binder")[1],
  maxSize = 0,
  maxScans = 5,
  nPermutations = 5000,
  probExploration = 0.005,
  seconds = 10,
  parallel = TRUE
)

Arguments

psm

A pairwise similarity matrix, i.e., n-by-n symmetric matrix whose (i,j) element gives the (estimated) probability that items i and j are in the same subset (i.e., cluster) of a partition (i.e., clustering).

loss

Either "VI.lb" or "binder", to indicate the desired loss function.

maxSize

Either zero or a positive integer. If a positive integer, the optimization is constrained to produce solutions whose number of subsets (i.e., clusters) is no more than the supplied value. If zero, the size is not constrained.

maxScans

The maximum number of reallocation scans after the intial allocation. The actual number of scans may be less than maxScans since the method stops if the result does not change between scans.

nPermutations

The desired number of permutations to consider when searching for the minimizer.

probExploration

The expected probability of picking the second best micro-optimization (instead of the best). For a given permutation, the probability is sampled from a beta distribution with shape 1 and 1/probExploration.

seconds

A time threshold in seconds after which the function will return early (with a warning) instead of finishing all the desired permutations. Note that the function could take considerably longer, however, because this threshold is only checked after each permutation is completed.

parallel

Should the search use all CPU cores?

Value

A list of the following elements:

estimate: An integer vector giving a partition encoded using cluster labels.
loss: A character vector equal to the loss argument.
expectedLoss: A numeric vector of length one giving the expected loss.
nScans: An integer vector giving the number of scans used to arrive at the supplied estimate.
nPermutations: An integer vector giving the number of permutations actually performed.

References

D. A. Binder (1978), Bayesian cluster analysis, Biometrika 65, 31-38.

D. B. Dahl (2006), Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model, in Bayesian Inference for Gene Expression and Proteomics, Kim-Anh Do, Peter M<U+00FC>ller, Marina Vannucci (Eds.), Cambridge University Press.

J. W. Lau and P. J. Green (2007), Bayesian model based clustering procedures, Journal of Computational and Graphical Statistics 16, 526-558. D. B. Dahl, M. A. Newton (2007), Multiple Hypothesis Testing by Clustering Treatment Effects, Journal of the American Statistical Association, 102, 517-526.

A. Fritsch and K. Ickstadt (2009), An improved criterion for clustering based on the posterior similarity matrix, Bayesian Analysis, 4, 367-391.

S. Wade and Z. Ghahramani (2018), Bayesian cluster analysis: Point estimation and credible balls. Bayesian Analysis, 13:2, 559-626.

Examples

Run this code

# NOT RUN {
probs <- psm(iris.clusterings, parallel=FALSE)
salso(probs, nPermutations=50, parallel=FALSE)

# }

Run the code above in your browser using DataLab