ptycho
generates samples for one design matrix and response
matrix while ptycho.all
runs in batch an object generated by
createData
.
ptycho(X, y, initStates, groups = NULL, tau.min = 0.01, tau.max = 10, tau.sd = (tau.max - tau.min)/4, doGPrior = TRUE, doDetPrior = FALSE, prob.varadd = 0.5, isOmegaFixed = FALSE, omega = NULL, omega.grp = NULL, probs.grp = NULL, rho.alpha = 10, rho.lambda = rho.alpha, only.means = FALSE, nburn = 0, nthin = 1, nSavePerChain, parallel.chains=FALSE, random.seed=NULL)
ptycho.all(data, across=c("none","traits","sites"), doGrpIndicator, dir.out, nreplicates=NULL, parallel.replicates=FALSE, doSetSeed=TRUE, ncolumns=NULL, ...)
indic.var
TRUE
, then covariate $j$ is initially in the model
for response $k$.
tau
indic.grp
indic.var
; NULL
to use priors that do
not incorporate a second-level indicator variable
var2group
group2var
sizes
Otherwise, NULL
.
tau
tau
initStates[[1]]]$indic.grp
is NULL
, the
probability that the Metropolis-Hastings proposal changes one entry of
indic.var
from FALSE
to TRUE
. Otherwise, the
probability of this event given that the proposal does not change
indic.grp
.omega
is knownisOmegaFixed
is TRUE
, a $p$-by-$q$
matrix containing the known probabilities. Otherwise, a matrix containing
the parameters for the Beta prior distribution on omega
. Such a
matrix has columns A and B; the number of rows should be:
initStates[[1]]$indic.grp
is NULL
,
length(groups$group2var)
if that is nonzero, or
If omega
is NULL
and isOmegaFixed
is FALSE
,
defaults to uniform priors.
isOmegaFixed
is TRUE
, the known probability
that entries in indic.grp
are TRUE
. Otherwise, a vector with
names A and B containing the parameters for the Beta prior
distribution on omega.grp
. If NULL
, defaults to uniform
priors. Unused if initStates[[1]]$indic.grp
is NULL
.NULL
, defaults to c(0.25,0.5,0.25)
.
Unused if initStates[[1]]$indic.grp
is NULL
.only.means
is TRUE
or a vector.only.means
is FALSE
, number of MCMC
samples to return from each chain, which means a total of
nthin * nSavePerChain + nburn
samples are drawn per chain. If
only.means
is TRUE
, then nSavePerChain + nburn
samples
are drawn, and only the averages of the last nSavePerChain
samples
are returned. Unused if only.means
is not a logical.NULL
, the
random seed is not set. See Details.createData
indic.grp
save
samples or meansNULL
, all will be runTRUE
, call set.seed(n.repl)
before running
samples.across
is none or
sites, each of the first ncolumns
of repl$y
will
be used in turn, running all columns by default. Ignored if across
is sites.ptycho
ptycho.all
are written to files by save
.
For priors that use only one response, the output for replicate $r$ and
column $c$ will be written to rpldir.out
. For priors that use multiple responses,
ptycho
is called only once for each replicate, and the file name will
be rplsmpl
and is the value of a call to ptycho
. The format of these
objects depends upon the argument only.means
. In all cases, however,
it has attribute params
set to a list containing most of the arguments
in the call to ptycho
.If only.means
is FALSE
, then ptycho
returns an
mcmc.list
whose length is the same as the length of
initStates
. Each entry in this list is an mcmc
object
with nSavePerChain
rows and a column for each entry of indic.var
and indic.grp
plus a column for tau
.Otherwise, ptycho
returns an object of class ptycho
, which is
actually a matrix. The matrix has a column for each sampled indicator
variable, for tau
and its square (so that its variance can be
computed), and for the chain and iteration numbers. If only.means
is
TRUE
, then each row contains the means of the samples in one chain and
there will be length(initStates) * nSavePerChain
rows.
If
only.means
is a vector, then there will be
length(initStates) * length(only.means)
rows.
ptycho.all
is a wrapper of ptycho
to simplify running
the simulation experiments over many replicates. These functions determine
which priors to use as follows:
ptycho.all
, argument across
is none and
doGrpIndicator
is FALSE
.
For ptycho
, argument y
has one column, groups
is
NULL
, and indic.grp
is NULL
or missing in each entry of
initStates
.
ptycho.all
, argument across
is traits and
doGrpIndicator
is TRUE
.
For ptycho
, argument y
has $p > 1$ columns, groups
is NULL
, and indic.grp
is a logical vector of length $p$
in each entry of initStates
.
ptycho.all
, argument across
is sites and
doGrpIndicator
is TRUE
.
For ptycho
, argument y
has one column, groups
specifies
how to combine information, and indic.grp
in each entry of
initStates
is a logical vector of the same length as
groups$group2var
.
ptycho.all
, argument across
is traits and
doGrpIndicator
is FALSE
.
For ptycho
, argument y
has $p > 1$ columns, groups
is NULL
, and indic.grp
is NULL
in each entry of
initStates
.
This prior does not properly correct for multiple hypothesis testing and is
only included because it is needed to reproduce results in Stell and Sabatti
(2015).
Combining information across both phenotypes and variants is planned for a
future release. These functions perform some checks for compatibility of
X
, y
, groups
, and initStates
; but invalid input
could lead to unpredictable behavior. Singular $X$ can result in an error;
even strongly correlated covariates can cause difficulties as described by Stell
and Sabatti (2015).
The simplest way to run the simulations in Stell and Sabatti (2015) is, for example,
data <- createPubData("pleiotropy") ptycho.all(data=data, across="traits", doGrpIndicator=TRUE, dir.out="/path/to/output/dir/", only.means=50000*(1:10), nburn=10000) ptycho.all(data=data, across="sites", doGrpIndicator=TRUE, dir.out="/path/to/another/dir/", groups=createGroupsSim(G=10, ncol(data$X)), only.means=50000*(1:10), nburn=10000)With these calls, the replicates run sequentially and so do the chains of the MCMC sampler; the results will be reproducible because the random seed is set for each replicate.
Parallelization is implemented via the foreach package. The user must not only have it installed but also an appropriate parallel backend, which must be registered. To run chains in parallel using the doMC, for example,
data(ptychoIn) G <- 2; p <- ncol(ptychoIn$X) groups <- createGroupsSim(G, p) state <- list(list(indic.grp=rep(FALSE,G), indic.var=matrix(FALSE,nrow=p,ncol=1), tau=1), list(indic.grp=rep(TRUE,G), indic.var=matrix(TRUE,nrow=p,ncol=1), tau=1)) require(doMC) registerDoMC(length(state)) ptychoOut <- ptycho(X=ptychoIn$X, y=ptychoIn$replicates[[1]]$y, groups=groups, initStates=state, only.means=100*seq_len(5), parallel.chains=TRUE)The results would not be reproducible, however, even if one set the random seed before calling
ptycho
. For reproducible results, pass the random seed
in the call to ptycho
, which requires that the doRNG package is
also installed. Running the chains in parallel when calling ptycho.all
also requires the option parallel.chains=TRUE
, which uses doRNG
unless doSetSeed=FALSE
. By default, one of the chains starts with all
variants in the model, so that chain takes much longer to run than do the other
chains. Consequently, when running multiple replicates via ptycho.all
,
much greater time savings can be achieved by running the replicates in parallel
with, for example,
data <- createPubData("pleiotropy") require(doMC) registerDoMC(8) ptycho.all(data=data, across="traits", doGrpIndicator=TRUE, dir.out="/path/to/output/dir/", only.means=50000*(1:10), nburn=10000)In this case, the default behavior of reproducible results does not require doRNG because the seed is set after each parallel worker is created.
We conclude this description with a discussion of the running time of the MCMC
sampler. Our actual data has 5335 subjects, 764 variants and three traits. An
mcmc.list
containing 50,000 samples for each of four chains can take
about 5~GB. Running chains in parallel, it takes less than an hour (on a Linux
computer with 2.6 GHz processors) to perform 510,000 samples per chain. The run
time depends primarily on the number of entries that are TRUE
in the
sampled indic.var
matrices; increasing this will increase run times. A
chain that initially has all entries of indic.var
set to TRUE
will
take longer than one where the model is initially empty. Priors that inflate
the posterior expectation of indic.var[j,k]
(such as combining
information across responses without using indic.grp
) will also take
longer.
createData
for simulating input data. checkConvergence
and PosteriorStatistics for analyzing
output of ptycho
.
Data describes tinysim
in example below as well as an object
created with ptycho
.
data(tinysim)
# Use replicate 4.
X <- tinysim$X; p <- ncol(X); nr <- 4
# COMBINE INFORMATION ACROSS RESPONSES
Y <- tinysim$replicates[[nr]]$y; q <- ncol(Y)
# Run 2 chains.
state <- list(list(indic.grp=rep(FALSE,p),
indic.var=matrix(FALSE,nrow=p,ncol=q), tau=1),
list(indic.grp=rep(TRUE,p),
indic.var=matrix(TRUE,nrow=p,ncol=q), tau=1))
# In each chain, discard first 10 burn-in samples, then generate
# 100 samples and save running means after every 20 samples.
smpl.ph <- ptycho(X=X, y=Y, initStates=state, only.means=20*(1:5),
nburn=10)
# COMBINE INFORMATION ACROSS VARIANTS
# Use two groups of variants.
G <- 2; groups <- createGroupsSim(G, p)
# Run 2 chains.
state <- list(list(indic.grp=rep(FALSE,G),
indic.var=matrix(FALSE,nrow=p,ncol=1), tau=1),
list(indic.grp=rep(TRUE,G),
indic.var=matrix(TRUE,nrow=p,ncol=1), tau=1))
# Use response 3.
y <- tinysim$replicates[[nr]]$y[,3,drop=FALSE]
smpl.var <- ptycho(X=X, y=y, groups=groups, initStates=state,
only.means=c(20*(1:5)), nburn=10, nthin=1)
Run the code above in your browser using DataLab