profLinear: Linear Product Partition Models

Description

This function finds the most probable cluster partition in a linear product partition model (PPM). The Dirichlet process mixture of linear models is the default PPM.

Usage

profLinear(formula, data, group, clust, param, method="agglomerative", maxiter=1000, crit=1e-6, verbose=FALSE, sampler=FALSE)

Arguments

formula

a formula.

data

a dataframe where formula is evaluated

group

optional vector of factors (or coercible to factors) indicating grouping among observations. Observations that are grouped will be clustered together. This is useful if several values form a single longitudinal observation.

clust

optional vector of factors (or coercible to factors) indicating initial clustering among observations. Grouped observations (see group) must have the same clust value.

param

optional list containing the any of the named elements alpha, a0, b0, m0, and s0 corresponding to the prior parameters of the normal-gamma Dirichlet process mixture. The prior parameters of the normal-gamma Dirichlet process mixture should all be scalars except m0 which should be a vector of length equal to the number of columns in x.

method

character string indicating the optimization method to be used. Meaningful values for this string are "stochastic", "gibbs", "agglomerative", and "none".

The "stochastic" method is an iterative stochastic search utilizing `explode' and `merge' operations on the clusters of a partition. At the explode step, a randomly selected subset of observations are redistributed uniformly at random to an existing or new cluster. Each of the exploded observations are then merged with an existing cluster in a sequentially optimal fashion. Optimization involves computing a moving average of the relative change in the marginal posterior distribution over the possible clusters after each iteration. The optimization stopping criterion is the minumim value this quantity can take before stopping the optimization cycle. If the optimization cycle reaches the maximum allowable iterations before meeting the stopping criterion, a warning is issued.
The "gibbs" method implements the Polya urn Gibbs sampler. This method draws samples from the posterior distribution over the cluster partition in a sequential Gibbs fashion. The sample value with greatest posterior mass is returned. See MacEachern(1994) for details.
The "agglomerative" method initially places each observation into seperate clusters. At each iteration, two of the remaining clusters are merged, where the merged clusters are chosen such that the resulting increase in the posterior mass function is maximized. This is repeated until only one cluster remains. The MAP estimate is the cluster partition, among those considered, which maximizes the posterior mass function over the possible cluster partitions. See Ward (1963) for additional details.
The "fast" method is a modified version of Sequential Update and Greedy Search (SUGS). This SUGS algorithm assigns observations to clusters sequentially. Initally, the first observation forms a singleton cluster. Subsequent observations are assigned to an existing cluster, or form a new singleton cluster by optimizing the associated posterior probabilities, conditional on previous cluster assignments. See Wang and Dunson (2010) for additional details.
The "none" method is typically used in conjunction with the clust option to specify an initial cluster partition. If "none" is specified without clust, a simple algorithm is used to initialize the cluster partition. Otherwise, the cluster partition is initialized using the clust argument. The posterior statistics are then computed for initialized clusters.

maxiter

integer value specifying the maximum number of iterations for the optimization algorithm.

crit

numeric scalar constituting a stopping criterion for the "stochastic" and "gibbs" optimization methods.

verbose

logical value indicating whether the routine should be verbose in printing.

sampler

for the "gibbs" method, return the last sampled value instead of the MAP estimate

Value

y: the numeric outcome vector, where missing observations (NA) are removed
x: the numeric design matrix, where missing covariates (NA) are removed
group: the grouping vector, where missing group values (NA) are removed
param: the list of prior parameters
clust: a numeric vector of integers indicating cluster membership for each non-missing observation
a: a numeric vector containing the posterior parameter $a$ for each cluster
b: a numeric vector containing the posterior parameter $b$ for each cluster
m: a list of numeric vectors containing the posterior vector $m$ for each cluster
s: a list of numeric matrices containing the posterior matrix $S$ for each cluster
logp: the unnormalized log value of the marginal posterior mass function for the cluster partition evaluated at clust
model: a model frame, resulting from a call to model.frame

Details

This function fits a Dirichlet process mixture of linear models (DPMLM) using the profile method. This method will group the observations into clusters. The clusters are determined by maximizing the marginal posterior distribution over the space of possible clusters. Each cluster has an associated linear model. Notationally, the linear model for cluster $ k $ has the form $$ y = \gamma^{\prime} x + \epsilon, $$ where $ y $ and $ x $ are the observation vector and covariate matrix for a particular cluster, $ e $ has a multivariate normal distribution with mean zero and precision matrix $ tI $, and $ g $ is the vector of linear coefficients. In the DPMLM, conditional on the clustering, $ g $ and $ t $ have a joint normal-gamma posterior distribution of the form $$ p(m, \tau| y, x) = N( \gamma | m, \tau S ) G( \tau | a, b ), $$ where $ N() $ is the multivariate normal density function with mean vector $ m $ and precision matrix $ tS $ and $ G() $ is the gamma density function with shape and scale parameters $ a $ and $ b $. In addition to the cluster indicators, the posterior quantities $ S $, $ m $, $ a $, and $ b $ are provided for each cluster in the return value.

Missing observations (NA) are removed automatically and a warning is issued.

References

Matthew S. Shotwell (2013). profdpm: An R Package for MAP Estimation in a Class of Conjugate Product Partition Models. Journal of Statistical Software, 53(8), 1-18. URL http://www.jstatsoft.org/v53/i08/.

Ward, J. H. (1963) Heirarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association 58:236-244 MacEachern, S. N. (1994) Estimating Normal Means with Conjugate Style Dirichlet Process Prior. Communications in Statistics B 23:727-741

Examples

Run this code

library(profdpm)
set.seed(42)

# set up some data
# linear model 0
x0  <- rnorm(50, 0, 3)
y0  <- x0 + rnorm(50, 0, 1) 

# linear model 1
x1  <- rnorm(50, 0, 3)
y1  <- 10 - x1 + rnorm(50, 0, 1)

# add a column of ones to the covariate matrix (intercept)
dat <- data.frame(x=c(x0, x1), y=c(y0,y1))

# indicate grouping within each linear model
grp <- c(rep(seq(0,4),10), rep(seq(5,9),10))

# fit the DPMLM
fit <- profLinear(y ~ x, data=dat, group=grp)

# plot the resulting fit(s)
plot(dat$x, dat$y, xlab='x', ylab='y')
for(i in 1:length(fit$m)) {
  abline( a=fit$m[[i]][1], b=fit$m[[i]][2] )
}

Run the code above in your browser using DataLab