profBinary: Binary Product Partition Models

Description

This function finds the most probable cluster partition in a binary product partition model (PPM). The Dirichlet process mixture of binary models is the default PPM.

Usage

profBinary(formula, data, clust, param, method="agglomerative", maxiter=1000, crit=1e-6, verbose=FALSE, sampler=FALSE)

Arguments

formula

a one-sided formula specifying a set of binary response variables.

data

a dataframe where formula is evaluated

clust

optional vector of factors (or coercible to factors) indicating initial clustering among observations.

param

optional list containing the any of the named elements alpha, a0, and b0 corresponding to the prior parameters of the beta-binary Dirichlet process mixture. The prior parameters of the beta-binary Dirichlet process mixture should all be scalars.

method

character string indicating the optimization method to be used. Meaningful values for this string are "stochastic", "gibbs", "agglomerative", and "none".

The "stochastic" method is an iterative stochastic search utilizing `explode' and `merge' operations on the clusters of a partition. At the explode step, a randomly selected subset of observations are redistributed uniformly at random to an existing or new cluster. Each of the exploded observations are then merged with an existing cluster in a sequentially optimal fashion. Optimization involves computing a moving average of the relative change in the marginal posterior distribution over the possible clusters after each iteration. The optimization stopping criterion is the minumim value this quantity can take before stopping the optimization cycle. If the optimization cycle reaches the maximum allowable iterations before meeting the stopping criterion, a warning is issued.
The "gibbs" method implements the Polya urn Gibbs sampler. This method draws samples from the posterior distribution over the cluster partition in a sequential Gibbs fashion. The sample value with greatest posterior mass is returned. See MacEachern (1994) for details.
The "agglomerative" method initially places each observation into seperate clusters. At each iteration, two of the remaining clusters are merged, where the merged clusters are chosen such that the resulting increase in the posterior mass function is maximized. This is repeated until only one cluster remains. The MAP estimate is the cluster partition, among those considered, which maximizes the posterior mass function over the possible cluster partitions. See Ward (1963) for additional details.
The "fast" method is a modified version of Sequential Update and Greedy Search (SUGS). This SUGS algorithm assigns observations to clusters sequentially. Initally, the first observation forms a singleton cluster. Subsequent observations are assigned to an existing cluster, or form a new singleton cluster by optimizing the associated posterior probabilities, conditional on previous cluster assignments. See Wang and Dunson (2010) for additional details.
The "none" method is typically used in conjunction with the clust option to specify an initial cluster partition. If "none" is specified without clust, a simple algorithm is used to initialize the cluster partition. Otherwise, the cluster partition is initialized using the clust argument. The posterior statistics are then computed for initialized clusters.

maxiter

integer value specifying the maximum number of iterations for the optimization algorithm.

crit

numeric scalar constituting a stopping criterion for the "stochastic" and "gibbs" optimization methods.

verbose

logical value indicating whether the routine should be verbose in printing.

sampler

for the "gibbs" method, return the last sampled value instead of the MAP estimate

Value

y: the numeric matrix of observations, where rows with missing observations (NA) are removed
param: the list of prior parameters
clust: a numeric vector of integers indicating cluster membership for each non-missing observation
a: a list of numeric vectors containing the posterior vector $a$ for each cluster
b: a list of numeric vectors containing the posterior vector $b$ for each cluster
logp: the unnormalized log value of the marginal posterior mass function for the cluster partition evaluated at clust
model: a model frame, resulting from a call to model.frame

Details

This function fits a Dirichlet process mixture of binary models (DPMBM) using the profile method. This method will cluster binary observations vectors (rows of y) into clusters. The cluster partition is estimated by maximizing the marginal posterior distribution over all possible cluster partitions. Each cluster has an associated binary model. The binary model assigns Bernoulli probabilities independently to each binary valued outcome, corresponding to the columns of y. The prior parameters a0 and b0 assign a beta prior distribution to each outcome probability. Conditional on the estimated cluster partition, each outcome probability is beta distributed a posteriori. The function profBinary returns the associated posterior parameters of the beta destribution for each cluster and outcome probability.

Missing observations (NA) are removed automatically and a warning is issued. The return value contains the reduced observation matrix.

References

Matthew S. Shotwell (2013). profdpm: An R Package for MAP Estimation in a Class of Conjugate Product Partition Models. Journal of Statistical Software, 53(8), 1-18. URL http://www.jstatsoft.org/v53/i08/.

Ward, J. H. (1963) Heirarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association 58:236-244 MacEachern, S. N. (1994) Estimating Normal Means with Conjugate Style Dirichlet Process Prior. Communications in Statistics B 23:727-741

Examples

Run this code

library(profdpm)
set.seed(42)

# simulate two clusters of multivariate binary data
p <- seq(0.9,0.1,length.out=3)
y1 <- matrix(rbinom(333, 1, p), 111, 3, TRUE)
y2 <- matrix(rbinom(333, 1, rev(p)), 111, 3, TRUE)
dat <- as.data.frame(rbind(y1, y2))

# fit the PPM
fitb <- profBinary(~0+., data=dat)

# plot the data ordered by cluster
image(t(as.matrix(fitb$model)[order(fitb$clust),]),
      xaxt="n", yaxt="n", col=0:1)
axis(3, labels=paste("V", 1:3, sep=""), at=0:2/2)

# plot the data ordered and colored by cluster
image(t(as.matrix(fitb$model) * fitb$clust)[, order(fitb$clust)],
      xaxt="n", yaxt="n", col=0:length(unique(fitb$clust)))
axis(3, labels=paste("V", 1:3, sep=""), at=0:2/2)

Run the code above in your browser using DataLab