MCMC_BB: Runs the Markov chain Monte Carlo with the overdispersion (Beta-Binomial) model

Description

This function initiates the Markov chain Monte Carlo (MCMC) for the beta-binomial BEDASSLE model. The beta-binomial model allows populations to diverge from the model's expectations based on their location and their neighbors.

Usage

MCMC_BB(counts, sample_sizes, D, E, k, loci, delta, aD_stp, aE_stp, a2_stp, phi_stp,
thetas_stp, mu_stp, ngen, printfreq, savefreq, samplefreq, directory = NULL, prefix = "", 
continue = FALSE, continuing.params = NULL)

Arguments

counts: A matrix of allelic count data, for which nrow = the number of populations and ncol = the number of bi-allelic loci sampled. Each cell gives the number of times allele `1' is observed in each population. The choice of which allele is allele `1' is arbitrary, but must be consistent across all populations at a locus.
sample_sizes: A matrix of sample sizes, for which nrow = the number of populations and ncol = the number of bi-allelic loci sampled (i.e. - the dimensions of sample.sizes must match those of counts). Each cell gives the number of chromosomes successfully genotyped at each locus in each population.
D: Pairwise geographic distance (\(D_{i,j}\)). This can be two-dimensional Euclidean distance, or great-circle distance, or, in fact, any positive definite matrix (deriving, for instance, from a resistance distance). However, note that the algorithm silently restricts the prior on the alpha parameters, and specifically the alpha_2 parameter, to the part of parameter space that results in valid covariance matrices; in the case of two-dimensional Euclidean distances, this will not happen, since any value of alpha_2 between 0 and 2 is valid (see Guillot et al.'s "Valid covariance models for the analysis of geographical genetic variation" for more detail on this).
E: Pairwise ecological distance(s) (\(E_{i,j}\)), which may be continuous (e.g. - difference in elevation) or binary (same or opposite side of some hypothesized barrier to gene flow). Users may specify one or more ecological distance matrices. If more than one is specified, they should be formatted as a list.
k: The number of populations in the analysis. This should be equal to nrow(counts).
loci: The number of loci in the analysis. This should be equal to ncol(counts)
delta: The size of the "delta shift" on the off-diagonal elements of the parametric covariance matrix, used to ensure its positive-definiteness (even, for example, when there are separate populations sampled at the same geographic/ecological coordinates). This value must be large enough that the covariance matrix is positive-definite, but, if possible, should be smaller than the smallest off- diagonal distance elements, lest it have an undue impact on inference. If the user is concerned that the delta shift is too large relative to the pairwise distance elements in D and E, she should run subsequent analyses, varying the size of delta, to see if it has an impact on model inference.
aD_stp: The scale of the tuning parameter on aD (alphaD). The scale of the tuning parameter is the standard deviation of the normal distribution from which small perturbations are made to those parameters updated via a random-walk sampler. A larger value of the scale of the tuning parameter will lead to, on average, larger proposed moves and lower acceptance rates (for more on acceptance rates, see plot_acceptance_rate).
aE_stp: The scale of the tuning parameter on aE (alphaE). If there are multiple ecological distances included in the analysis, there will be multiple alphaE parameters (one for each matrix in the list of E). These may be updated all with the same scale of a tuning parameter, or they can each get their own, in which case aE_stp should be a vector of length equal to the number of ecological distance variables.
a2_stp: The scale of the tuning parameter on a2 (alpha_2).
phi_stp: The scale of the tuning parameter on the phi parameters.
thetas_stp: The scale of the tuning parameter on the theta parameters.
mu_stp: The scale of the tuning parameter on mu.
ngen: The number of generations over which to run the MCMC (one parameter is updated at random per generation, with mu, theta, and phi all counting, for the purposes of updates, as one parameter).
printfreq: The frequency with which MCMC progress is printed to the screen. If printfreq =1000, an update with the MCMC generation number and the posterior probability at that generation will print to the screen every 1000 generations.
savefreq: The frequency with which the MCMC saves its output as an R object (savefreq = 50,000 means that MCMC output is saved every 50,000 generations). If ngen is large, this saving process may be computationally expensive, and so should not be performed too frequently. However, users may wish to evalute MCMC performance while the chain is still running, or may be forced to truncate runs early, and should therefore specify a savefreq that is less than ngen. We recommend a savefreq of between 1/10th and 1/20th of ngen.
samplefreq: The thinning of the MCMC chain (samplefreq = 1000 means that the parameter values saved in the MCMC output are sampled once every 1000 generations). A higher samplefreq will decrease parameter autocorrelation time. However, there is still information in autocorrelated draws from the joint posterior, so the samplefreq should be viewed merely as a computational convenience, to decrease the size of the MCMC output objects.
directory: If specified, this points to a directory into which output will be saved.
prefix: If specified, this prefix will be added to all output file names.
continue: If TRUE, this will initiate the MCMC chain from the last parameter values of a previous analysis. This option can be used to effectively increase the ngen of an initial run. If FALSE, the MCMC will be initiated from random parameter values.
continuing.params: The list of parameter values used to initiate the MCMC if continue = TRUE. If the user wants to continue an analysis on a dataset, these should be the parameter values from the last generation of the previous analysis. This list may be generated using the function make.continuing.params.

Author

Gideon Bradburd

Details

This function saves an MCMC output object at intervals specified by savefreq. This object may be ported into R working memory using the base function load.

As with any MCMC method, it is very important here to perform MCMC diagnosis and evaluate chain mixing. I have provided a number of MCMC diagnosis graphing functions for user convenience in visually assessing MCMC output. These include plot_all_trace;plot_all_marginals; plot_all_joint_marginals; and plot_all_acceptance_rates. To evaluate model adequacy, users should use posterior.predictive.sample and plot_posterior_predictive_sample. These MCMC diagnosis/model adequacy functions all call the standard MCMC output R object that the BEDASSLE MCMC generates as their principal argument.

If users wish to start another MCMC run from where the current run left off, they should use make.continuing.params, and initiate the new run with option continue = TRUE and the continuing.params list from the previous run specified.

References

Bradburd, G.S., Ralph, P.L., and Coop, G.M. Disentangling the effects of geographic and ecological isolation on genetic differentiation. Evolution 2013.

Examples

Run this code

#With the HGDP dataset and mcmc operators

	data(HGDP.bedassle.data)
	data(mcmc.operators)

#The beta-binomial likelihood function may generate "NaNs produced" warnings, 
#so temporarily disable warnings.
	op <- options("warn")
	options(warn = -1)

#Call the Markov chain Monte Carlo for the overdispersion model	
if (FALSE) {
	MCMC_BB(
		counts = HGDP.bedassle.data$allele.counts,
		sample_sizes = HGDP.bedassle.data$sample.sizes,
		D = HGDP.bedassle.data$GeoDistance,
		E = HGDP.bedassle.data$EcoDistance,
		k = HGDP.bedassle.data$number.of.populations,
		loci = HGDP.bedassle.data$number.of.loci,
		delta = mcmc.operators$delta,
		aD_stp = mcmc.operators$aD_stp,
		aE_stp = mcmc.operators$aE_stp,
		a2_stp = mcmc.operators$a2_stp,
		phi_stp = mcmc.operators$phi_stp,
		thetas_stp = mcmc.operators$thetas_stp,
		mu_stp = mcmc.operators$mu_stp,
		ngen = mcmc.operators$ngen,
		printfreq = mcmc.operators$printfreq,
		savefreq = mcmc.operators$savefreq,
		samplefreq = mcmc.operators$samplefreq,
		directory = NULL,
		prefix = mcmc.operators$prefix,
		continue = FALSE,
		continuing.params = NULL
	)
}
#Re-enable warnings
	options(op)

Run the code above in your browser using DataLab