createData: Simulate Data

Description

Create a data object suitable for ptycho.all.

Usage

createData(X, y, omega = NULL, beta = NULL)
createDataBayesModel(mode = c("exchange","pleiotropy","gene"), n, p, q,
                     nreps, tau.min, tau.max, G)
createPubData(mode = c("tinysim","ptychoIn",
                       "exchange","pleiotropy","gene",
                       "actualGeno","actualPheno","corTest",
                       "fixedOmega","uniformEffects"),
              X=NULL, y=NULL, var.detail=NULL, variants=NULL)

Value

List containing:

X

Design matrix

q

Number of columns in each response

noise.sd

Standard deviation of the simulated noise; NULL if input y is numeric

omega

Input omega

beta

Input beta

replicates

List of length y$nreps (length 1 if y is numeric), each entry of which is a list with the following components:

omega: Matrix containing probabilities of association between covariates and responses; row names are colnames(X) and column names are colnames(y); NULL if input y is numeric

indic.var

Matrix containing ones for associations and zeros otherwise; row and column names are same as for omega; NULL if input y is numeric

beta

Matrix of effect sizes; row and column names are same as for omega; NULL if input y is numeric

y

Response matrix

eta2

Matrix with row names equal to colnames(X) and column names equal to colnames(y)

For createDataBayesModel with mode that uses a second level of indicator variables, each entry in the replicate list also has components omega.grp and indic.grp containing the intermediate steps of drawing the second-level indicator variable before drawing omega. If the argument beta to createData is “createBetaNormal” (which it is when called by createDataBayesModel), then each replicate will also have a component tau giving the value drawn by a call to runif(1, tau.min, tau.max).

Arguments

X

Design matrix or alist specifying how to generate such a matrix. If a list, the first entry is a function name and the second is a list of arguments to the function. In createPubData, X is ignored unless mode is “actualGeno”, “actualPheno”, or corTest.

y

Numeric vector or matrix or list with the following components:

nreps: Number of replicates to simulate

q

Number of responses to generate for each replicate

sd

Standard deviation of the simulated noise

In createPubData, y is ignored unless mode is “actualPheno”.

omega

Numeric vector or matrix or list specifying how to generate a list with the component omega; see Details for its meaning. If this is a list, the first entry is a function name and the second is a list of arguments to the function, which will be prepended by the number of rows in output X and the number of columns in output y. Only used if y is a list.

beta

List specifying how to generate a matrix of effect sizes. The first entry of the list is a function name and the second is a list of arguments to the function, which will be prepended by a matrix specifying the variables selected and y$sd. Only used if y is a list.

Number of observations to simulate

Number of covariates to simulate

Number of responses to simulate for each replicate

nreps

Number of replicates to simulate

mode

String specifying type of dataset to create:

tinysim: Simulated data included with this package; equivalent to mode pleiotropy except that the dataset is tiny, with n=100, p=10, q=5, and nreps=10

ptychoIn

Simulated data included with this package; equivalent to mode gene except that the dataset is tiny, with n=3000, p=10, q=1, and nreps=1

exchange

Create orthogonal X and exchangeable variants; n=5000, p=50, q=5, and nreps=100

pleiotropy

Create orthogonal X, and several variants have nonzero effects on multiple responses; n=5000, p=50, q=5, and nreps=100

gene

Create orthogonal X, and each group of variants typically has either several or no variants that effect a response; n=5000, p=50, q=5, and nreps=100

actualGeno

Simulate responses for input X

corTest

Simulate q=2 responses for input X. There will be 10 replicates with the first variant in argument variants causal for both responses, 10 with the second variant causal, and 20 with variant i causal for response i. No other variant will be causal.

actualPheno

Put input X and y into data object

fixedOmega

Create orthogonal X, and each variant has a certain probability of a nonzero effect size

uniformEffects

Same as mode fixedOmega except that effect sizes are uniformly rather than normally distributed

For createDataBayesModel, mode must be one of “exchange”, “pleiotropy”, or “gene”.

tau.min, tau.max

Endpoints of uniform distribution from which to draw tau

Number of groups of covariates; unused if mode is not “gene”

var.detail

Data frame with row names same as column names of X; must have columns “MAF” and “GENE”. Ignored unless mode is “actualGeno”.

variants

Character vector containing names of two columns of X; ignored unless mode is “corTest”.

Author

Laurel Stell and Chiara Sabatti
Maintainer: Laurel Stell <lstell@stanford.edu>

Details

We describe createData and then describe its wrappers createDataBayesModel and createPubData.

Although createData can form the data object required by ptycho.all when X and y are input, it primarily exists to simplify simulating data from $Y=X\beta+\epsilon$, where $\epsilon$ is normal with mean zero and specified standard deviation and $\beta$ is sparse with entries simulated as specified.

The function generates a specified number of replicates, all of which use the same design matrix $X$. If this matrix is not input, then its argument must specify a function call to generate it. In either case, suppose $X$ has $n$ rows and $p$ columns.

If the input y is numeric, then it will be used for the lone replicate. If it is a matrix, it must have $n$ rows; let $q$ be its number of columns. If input y is a numeric vector, it must have $n$ entries and will be cast as a matrix with $q=1$ column. Otherwise, input y is a list specifying, along with the arguments omega and beta, how to simulate the response(s). Because it is useful in analysis of the estimation of the marginal posterior distribution, the returned object always contains, regardless of how X and y are specified, a matrix eta2 with $(j,k)$ entry equal to $\mathbf{x}_j^T \mathbf{y}_k / (n \mathbf{y}_k^T \mathbf{y}_k)$

If y is to be simulated, the first step is to choose the probability that each covariate is associated with each reponse as specified by the input argument omega. If this argument is a matrix, it must have size $p$-by-$q$. If it is not a matrix but is numeric, it will be passed to matrix to create a matrix of the correct size. Otherwise, the matrix for each replicate will be generated by calling the function whose name is given by omega[[1]] with argument list (p, q, omega[[2]]). This function must return a list with component omega set to a $p$-by-$q$ matrix; the list may also contain additional components. The package contains several functions whose names start with “createOmega” that might guide users in writing their own functions.

The next step is to draw a $p$-by-$q$ matrix indic.var whose $(j,k)$ entry is equal to one with probability omega[j,k] and zero otherwise. This matrix will be drawn until all column sums are positive.

For each entry in indic.var that is equal to one, the effect size must be drawn. This is done by calling the function whose name is given by beta[[1]] with argument list (indic.var, y$sd, beta[[2]]). This function must return a list with component beta set to a $p$-by-$q$ matrix; the list may also contain additional components. If indic.var[j,k] is zero, then beta[j,k] should be zero. The package contains functions whose names start with “createBeta” that might guide users in writing their own functions.

Finally, an $n$-by-$q$ matrix of noise is drawn from $N(0,\sigma^2)$, where $\sigma$ is the input noise.sd, and added to $X\beta$ to obtain y. The column names of each response matrix generated will be y1, y2, and so forth.

The function createPubData generates the data sets used in Stell and Sabatti (2015). For mode equal to “exchange”, “pleiotropy”, or “geno”, it calls createData via createDataBayesModel; otherwise, it calls createData directly. These functions also serve as additional examples of the use of createData. For reproducibility, createPubData first sets the random seed to 1234, except that it is set to 4 when mode equals “ptychoIn” and it does not set it when mode equals “corTest”.

In createDataBayesModel, if mode is “exchange”, then one $\omega \sim \mbox{Beta}(12,48)$ is drawn independently for each trait. If mode is “pleiotropy”, then one probability of association for a trait is drawn from Beta(16,55) for each data set, that probability is used to draw indic.grp for each variant, and then the probability of nonzero indic.var[j,k] is drawn from Beta(48,12) for each nonzero indic.grp[j]. Finally, if mode is “gene”, the process is analogous to pleiotropy except that each trait is simulated independently.

References

Stell, L. and Sabatti, C. (2015) Genetic variant selection: learning across traits and sites, arXiv:1504.00946.

Examples

Run this code

### EXAMPLE 1
data(tinysim)
# Data generated with mode equal to pleiotropy, so indic.grp exists and
# has an entry for each column in X.
colnames(tinysim$X)
tinysim$replicates[[5]]$indic.grp
# X4, X6, and X9 are associated with some responses.
tinysim$replicates[[5]]$indic.var

### EXAMPLE 2
# Generate miniature data set with information shared across covariates.
set.seed(1234)
tiny1 <- createDataBayesModel(mode="gene", n=100, p=10, q=5, nreps=10,
                              tau.min=0.045, tau.max=0.063, G=2)
# A covariate can only have indic.var=1 if the group it belongs to has
# indic.grp=1.  For example,indic.grp[1,4]=0 implies
# indic.var[groups$group2var[1],4]=0.
tiny1$replicates[[1]]$indic.grp
tiny1$omega[[2]]$groups$group2var[1]
tiny1$replicates[[1]]$indic.var

### EXAMPLE 3
# Alternatively, call createData directly
groups <- createGroupsSim(G=2, p=10)
omegaargs <- list(indic.grp.shape1=16, indic.grp.shape2=55,
                  shape1=48, shape2=12, groups=groups)
betaargs <- list(tau.min=0.045, tau.max=0.063)
set.seed(1234)
tiny2 <- createData(X=list("createOrthogonalX", list(n=100, p=10)),
                    y=list(nreps=10, q=5, sd=1),
                    omega=list("createOmegaCrossVars", omegaargs),
                    beta=list("createBetaNormal", betaargs))
identical(tiny1, tiny2)
### SEE THE CODE FOR createPubData FOR MORE EXAMPLES.

Run the code above in your browser using DataLab