This function provides a point estimate for a partition distribution using the sequentially-allocated latent structure optimization (SALSO) method. The method seeks to minimize the expectation of the Binder loss or the lower bound of the expectation of the variation of information loss. The SALSO method was presented at the workshop "Bayesian Nonparametric Inference: Dependence Structures and their Applications" in Oaxaca, Mexico on December 6, 2017. See <https://www.birs.ca/events/2017/5-day-workshops/17w5060/schedule>.
salso(
psm,
loss = c("VI.lb", "binder")[1],
maxSize = 0,
maxScans = 5,
nPermutations = 5000,
probExploration = 0.005,
seconds = 10,
parallel = TRUE
)
A pairwise similarity matrix, i.e., n
-by-n
symmetric
matrix whose (i,j)
element gives the (estimated) probability that
items i
and j
are in the same subset (i.e., cluster) of a
partition (i.e., clustering).
Either "VI.lb"
or "binder"
, to indicate the desired
loss function.
Either zero or a positive integer. If a positive integer, the optimization is constrained to produce solutions whose number of subsets (i.e., clusters) is no more than the supplied value. If zero, the size is not constrained.
The maximum number of reallocation scans after the intial
allocation. The actual number of scans may be less than maxScans
since the method stops if the result does not change between scans.
The desired number of permutations to consider when searching for the minimizer.
The expected probability of picking the second best
micro-optimization (instead of the best). For a given permutation, the
probability is sampled from a beta distribution with shape 1
and
1/probExploration
.
A time threshold in seconds after which the function will return early (with a warning) instead of finishing all the desired permutations. Note that the function could take considerably longer, however, because this threshold is only checked after each permutation is completed.
Should the search use all CPU cores?
A list of the following elements:
An integer vector giving a partition encoded using cluster labels.
A character vector equal to the loss
argument.
A numeric vector of length one giving the expected loss.
An integer vector giving the number of scans used to arrive at the supplied estimate.
An integer vector giving the number of permutations actually performed.
D. A. Binder (1978), Bayesian cluster analysis, Biometrika 65, 31-38.
D. B. Dahl (2006), Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model, in Bayesian Inference for Gene Expression and Proteomics, Kim-Anh Do, Peter M<U+00FC>ller, Marina Vannucci (Eds.), Cambridge University Press.
J. W. Lau and P. J. Green (2007), Bayesian model based clustering procedures, Journal of Computational and Graphical Statistics 16, 526-558. D. B. Dahl, M. A. Newton (2007), Multiple Hypothesis Testing by Clustering Treatment Effects, Journal of the American Statistical Association, 102, 517-526.
A. Fritsch and K. Ickstadt (2009), An improved criterion for clustering based on the posterior similarity matrix, Bayesian Analysis, 4, 367-391.
S. Wade and Z. Ghahramani (2018), Bayesian cluster analysis: Point estimation and credible balls. Bayesian Analysis, 13:2, 559-626.
# NOT RUN {
probs <- psm(iris.clusterings, parallel=FALSE)
salso(probs, nPermutations=50, parallel=FALSE)
# }
Run the code above in your browser using DataLab