Learn R Programming

LaplacesDemon (version 16.1.0)

PMC: Population Monte Carlo

Description

The PMC function updates a model with Population Monte Carlo. Given a model specification, data, and initial values, PMC maximizes the logarithm of the unnormalized joint posterior density and provides samples of the marginal posterior distributions, deviance, and other monitored variables.

Usage

PMC(Model, Data, Initial.Values, Covar=NULL, Iterations=10, Thinning=1,
alpha=NULL, M=1, N=1000, nu=9, CPUs=1, Type="PSOCK")

Arguments

Model

This is a model specification function. For more information, see LaplacesDemon.

Initial.Values

This is either a vector initial values, one for each of \(K\) parameters, or in the case of a mixture of \(M\) components, this is a \(M \times K\) matrix of initial values. If all initial values are zero in this vector, or in the first row of a matrix, then LaplaceApproximation is used to optimize initial values, in which case all mixture components receive the same initial values and covariance matrix from the object of class laplace. Parameters must be continuous.

Data

This is a list of data. For more information, see LaplacesDemon.

Covar

This is a \(K \times K\) covariance matrix for \(K\) parameters, or for multiple mixture components, this is a \(K \times K \times M\) array of \(M\) covariance matrices, where \(M\) is the number of mixture components. Covar defaults to NULL, in which case a scaled identity matrix (with the same scale as in LaplacesDemon) is applied to all mixture components.

Iterations

This is the number of iterations during which PMC will update the model. Updating the model for only one iteration is the same as applying non-adaptive importance sampling.

Thinning

This is the number by which the posterior is thinned. To have 1,000 posterior samples with M=3 mixture components and N=10000 samples each, Thinning=30. For more information, see the Thin function.

alpha

This is a vector of length \(M\), the number of mixture components. \(\alpha\) is the probability of each mixture component. The default value is NULL, which assigns an equal probability to each component.

M

This is the number \(M\) of multivariate t distribution mixture components.

N

This is the number \(N\) of samples per mixture component. The required number of samples increases with the number \(K\) of parameters. These samples are also called walkers or particles.

nu

This is the degrees of freedom parameter \(\nu\) for the multivariate t distribution for each mixture component. If a multivariate normal distribution is preferred, then set \(\nu > 1e4\).

CPUs

This argument is required for parallel processing, and and indicates the number of central processing units (CPUs) of the computer or cluster. For example, when a user has a quad-core computer, CPUs=4.

Type

This argument defaults to "PSOCK" and uses the Simple Network of Workstations (SNOW) for parallelization. Alternatively, Type="MPI" may be specified to use Message Passing Interface (MPI) for parallelization.

Value

The returned object is an object of class pmc with the following components:

alpha

This is a \(M \times T\) matrix of the probabilities of mixture components, where \(M\) is the number of mixture components and \(T\) is the number of iterations.

Call

This is the matched call of PMC.

Covar

This stores the \(K \times K \times T \times M\) proposal covariance matrix in an array, where \(K\) is the dimension or number of parameters or initial values, \(T\) is the number of iterations, and \(M\) is the number of mixture components. If the model is updated in the future, then the latest covariance matrix for each mixture component can be extracted and used to start the next update where the last update left off.

Deviance

This is a vector of the deviance of the model, with a length equal to the number of thinned samples that were retained. Deviance is useful for considering model fit, and is equal to the sum of the log-likelihood for all rows in the data set, which is then multiplied by negative two.

DIC

This is a vector of three values: Dbar, pD, and DIC. Dbar is the mean deviance, pD is a measure of model complexity indicating the effective number of parameters, and DIC is the Deviance Information Criterion, which is a model fit statistic that is the sum of Dbar and pD. DIC is calculated over the thinned samples. Note that pD is calculated as var(Deviance)/2 as in Gelman et al. (2004).

ESSN

This is a vector of length \(T\) that contains the normalized effective sample size (ESSN) per iteration across \(T\) iterations. ESSN is used as a convergence diagnostic. ESSN is normalized between zero and one, and can be interpreted as the proportion of samples with non-zero weight. Higher is better.

Initial.Values

This is the vector or matrix of Initial.Values.

Iterations

This reports the number of Iterations for updating.

LML

This is an approximation of the logarithm of the marginal likelihood of the data (see the LML function for more information). LML is estimated with nonparametric self-normalized importance sampling (NSIS), given LL and the marginal posterior samples of the parameters. LML is useful for comparing multiple models with the BayesFactor function.

M

This reports the number of mixture components.

Minutes

This indicates the number of minutes that PMC was running, and this includes the initial checks as well as time it took to perform final sampling and create summaries.

Model

This contains the model specification Model.

N

This is the number of un-thinned samples per mixture component.

itemnuThis is the degrees of freedom parameter \(\nu\) for each multivariate t distribution in each mixture component.
Mu

This is a \(T \times K \times M\) array of means for the importance sampling distribution across \(T\) iterations, \(K\) parameters, and \(M\) mixture components.

Monitor

This is a \(S \times J\) matrix of thinned samples of monitored variables, where \(S\) is the number of thinned samples and \(J\) is the number of monitored variables.

Parameters

This reports the number \(K\) of parameters.

Perplexity

This is a vector of length \(T\) that contains the normalized perplexity per iteration across \(T\) iterations, and is used as a convergence diagnostic. Perplexity is an approximation of the negative of the Kullback-Leibler divergence (see KLD) between the target and the importance function. Perplexity is normalized between zero and one, and a higher normalized perplexity relates to less divergence, so higher is better. A normalized perplexity that is close to one indicates good agreement between the target density and the importance function. This is based on the Shannon entropy of the normalized importance weights, which is used frequently to measure the quality of importance samples.

Posterior1

This is an \(N \times K \times T \times M\) array of un-thinned posterior samples across \(N\) samples, \(K\) parameters, \(T\) iterations, and \(M\) mixture components.

Posterior2

This is a \(S \times K\) matrix of thinned posterior samples, where \(S\) is the number of thinned samples and \(K\) is the number of parameters.

Summary

This is a matrix that summarizes the marginal posterior distributions of the parameters, deviance, and monitored variables from thinned samples. The following summary statistics are included: mean, standard deviation, MCSE (Monte Carlo Standard Error), ESS is the effective sample size due to autocorrelation, and finally the 2.5%, 50%, and 97.5% quantiles are reported. MCSE is essentially a standard deviation around the marginal posterior mean that is due to uncertainty associated with using Monte Carlo sampling. The acceptable size of the MCSE depends on the acceptable uncertainty associated around the marginal posterior mean. The default IMPS method is used. Next, the desired precision of ESS depends on the user's goal.

Thinned.Samples

This is the number of thinned samples in Posterior2.

Thinning

This is the amount of thinning requested by the user.

W

This is a \(N \times T\) matrix of normalized importance weights, where \(N\) is the number of un-thinned samples per mixture component and \(T\) is the number of iterations. Computationally, the algorithm uses the logarithm of the weights.

Details

The PMC function uses the adaptive importance sampling algorithm of Wraith et al. (2009), also called Mixture PMC or M-PMC (Cappe et al., 2008). Iterative adaptive importance sampling was introduced in the 1980s. Modern PMC was introduced (Cappe et al., 2004), and extended to multivariate Gaussian or t-distributed mixtures (Cappe et al., 2008). This version uses a multivariate t distribution for each mixture component, and also allows a multivariate normal distribution when the degrees of freedom, \(\nu > 1e4\). At each iteration, a mixture distribution is sampled with importance sampling, and the samples (or populations) are adapted to improve the importance sampling. Adaptation is a variant of EM (Expectation-Maximization). The sample is self-normalized, and is an example of self-normalized importance sampling (SNIS), or self-importance sampling. The vector \(\alpha\) contains the probability of each mixture component. These, as well as multivariate t distribution mixture parameters (except \(\nu\)), are adapted each iteration.

Advantages of PMC over MCMC include:

  • It is difficult to assess convergence of MCMC chains, and this is not necessary in PMC (Wraith et al., 2009).

  • MCMC chains have autocorrelation that effectively reduces posterior samples. PMC produces independent samples that are not reduced with autocorrelation.

  • PMC has been reported to produce samples with less variance than MCMC.

  • It is difficult to parallelize MCMC. Posterior samples from parallel chains can be pooled when all chains have converged, but until this occurs, parallelization is unhelpful. PMC, on the other hand, can parallelize the independent, Monte Carlo samples during each iteration and reduce run-time as the number of processors increases. Currently, PMC is not parallelized here.

  • The multivariate mixture in PMC can represent a multimodal posterior, where MCMC with parallel chains may be used to identify a multimodal posterior, but probably will not yield combined samples that proportionally represent it.

Disadvantages of PMC, compared to MCMC, include:

  • In PMC, the required number of samples at each iteration increases quickly with respect to an increase in parameters. MCMC is more suitable for models with large numbers of parameters, and therefore, MCMC is more generalizable.

  • PMC is more sensitive to initial values than MCMC, especially as the number of parameters increases.

  • PMC is more sensitive to the initial covariance matrix (or matrices for mixture components) than adaptive MCMC. PMC requires more information about the target distributions before updating. The covariance matrix from a converged iterative quadrature algorithm, Laplace Approximation, or Variational Bayes may be required (see IterativeQuadrature, LaplaceApproximation, or VariationalBayes for more information).

Since PMC requires better initial information than iterative quadrature, Laplace Approximation, MCMC, and Variational Bayes, it is not recommended to begin updating a model that has little prior information with PMC, especially when the model has more than a few parameters. Instead, iterative quadrature, Laplace Approximation, MCMC, or Variational Bayes should be used. However, once convergence is found or assumed, it is recommended to attempt to update the model with PMC, given the latest parameters and convariance matrix from iterative quadrature, Laplace Approximation, MCMC, or Variational Bayes. Used in this way, PMC may improve the model fit obtained with MCMC and should reduce the variance of the marginal posterior distributions, which is desirable for predictive modeling.

Convergence is assessed by observing two outputs: normalized effective sample size (ESSN) and normalized perplexity (Perplexity). These are described below. PMC is considered to have converged when these diagnostics stabilize (Wraith et al., 2009), or when the normalized perplexity becomes sufficiently close to 1 (Cappe et al., 2008). If they do not stabilize, then it is suggested to begin PMC again with a larger number N of samples, and possibly with different initial values and covariance matrix or matrices. IterativeQuadrature, LaplaceApproximation, or VariationalBayes may be helpful to provide better starting values for PMC.

If a message appears that warns about `bad weights', then PMC is attempting to work with an iteration in which importance weights are problematic. If this occurs in the first iteration, then all importance weights are set to \(1/N\). If this occurs in other iterations, then the information from the previous iteration is used instead and different draws are made from that importance distribution. This may allow PMC to eventually adapt successfully to the target. If not, the user is advised to begin again with a larger number \(N\) of samples, and possibly different initial values and covariance matrix or matrices, as above. PMC can experience difficulty when it begins with poor initial conditions.

The user may combine samples from previous iterations with samples from the latest iteration for inference, if the algorithm converged before the last iteration. Currently, a function is not provided for combining previous samples.

References

Cappe, O., Douc, R., Guillin, A., Marin, J.M., and Robert, C. (2008). "Adaptive Importance Sampling in General Mixture Classes". Statistics and Computing, 18, p. 587--600.

Cappe, O., Guillin, A., Marin, J.M., and Robert, C. (2004). "Population Monte Carlo". Journal of Computational and Graphical Statistics, 13, p. 907--929.

Gelman, A., Carlin, J., Stern, H., and Rubin, D. (2004). "Bayesian Data Analysis, Texts in Statistical Science, 2nd ed.". Chapman and Hall, London.

Wraith, D., Kilbinger, M., Benabed, K., Cappe, O., Cardoso, J.F., Fort, G., Prunet, S., and Robert, C.P. (2009). "Estimation of Cosmological Parameters Using Adaptive Importance Sampling". Physical Review D, 80(2), p. 023507.

See Also

BayesFactor, IterativeQuadrature, LaplaceApproximation, LML, PMC.RAM, Thin, and VariationalBayes.

Examples

Run this code
# NOT RUN {
# The accompanying Examples vignette is a compendium of examples.
####################  Load the LaplacesDemon Library  #####################
library(LaplacesDemon)

##############################  Demon Data  ###############################
data(demonsnacks)
y <- log(demonsnacks$Calories)
X <- cbind(1, as.matrix(log(demonsnacks[,c(1,4,10)]+1)))
J <- ncol(X)
for (j in 2:J) X[,j] <- CenterScale(X[,j])

#########################  Data List Preparation  #########################
mon.names <- "LP"
parm.names <- as.parm.names(list(beta=rep(0,J), sigma=0))
pos.beta <- grep("beta", parm.names)
pos.sigma <- grep("sigma", parm.names)
PGF <- function(Data) {
     beta <- rnorm(Data$J)
     sigma <- runif(1)
     return(c(beta, sigma))
     }
MyData <- list(J=J, PGF=PGF, X=X, mon.names=mon.names,
     parm.names=parm.names, pos.beta=pos.beta, pos.sigma=pos.sigma, y=y)

##########################  Model Specification  ##########################
Model <- function(parm, Data)
     {
     ### Parameters
     beta <- parm[Data$pos.beta]
     sigma <- interval(parm[Data$pos.sigma], 1e-100, Inf)
     parm[Data$pos.sigma] <- sigma
     ### Log-Prior
     beta.prior <- sum(dnormv(beta, 0, 1000, log=TRUE))
     sigma.prior <- dhalfcauchy(sigma, 25, log=TRUE)
     ### Log-Likelihood
     mu <- tcrossprod(Data$X, t(beta))
     LL <- sum(dnorm(Data$y, mu, sigma, log=TRUE))
     ### Log-Posterior
     LP <- LL + beta.prior + sigma.prior
     Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LP,
          yhat=rnorm(length(mu), mu, sigma), parm=parm)
     return(Modelout)
     }

set.seed(666)

############################  Initial Values  #############################
Initial.Values <- GIV(Model, MyData, PGF=TRUE)

########################  Population Monte Carlo  #########################
Fit <- PMC(Model, MyData, Initial.Values, Covar=NULL, Iterations=5,
     Thinning=1, alpha=NULL, M=1, N=100, CPUs=1)
Fit
print(Fit)
PosteriorChecks(Fit)
caterpillar.plot(Fit, Parms="beta")
plot(Fit, BurnIn=0, MyData, PDF=FALSE)
Pred <- predict(Fit, Model, MyData, CPUs=1)
summary(Pred, Discrep="Chi-Square")
plot(Pred, Style="Covariates", Data=MyData)
plot(Pred, Style="Density", Rows=1:9)
plot(Pred, Style="ECDF")
plot(Pred, Style="Fitted")
plot(Pred, Style="Jarque-Bera")
plot(Pred, Style="Predictive Quantiles")
plot(Pred, Style="Residual Density")
plot(Pred, Style="Residuals")
Levene.Test(Pred)
Importance(Fit, Model, MyData, Discrep="Chi-Square")

#End
# }

Run the code above in your browser using DataLab