getSample: Generates Simulated Data

Description

Generates a run of simulated observations of the form $(W,X,Y)$ to investigate the "effect" of $X$ on $Y$ taking $W$ into account.

Usage

getSample(n, O, lambda0, p = rep(1/3, 3), omega = rep(1/2, 3),  Sigma1 = matrix(c(1, 1/sqrt(2), 1/sqrt(2), 1), 2, 2), sigma2 = omega[2]/5,  Sigma3 = Sigma1, f = identity, verbose = FALSE)

Arguments

An integer, the number of observations to be generated.

A 3x3 numeric matrix or data.frame. Rows are 3 baseline observations used as "class centers" for the simulation. Columns are:

$W$, baseline covariate (e.g. DNA methylation), or "confounder" in a causal model.
$X$, continuous exposure variable (e.g. DNA copy number), or "cause" in a causal model , with a reference value $x_0$ equal to O[2, "X"].
$Y$, outcome variable (e.g. gene expression level), or "effect" in a causal model.

lambda0

A function that encodes the relationship between $W$ and $Y$ in observations with levels of $X$ equal to the reference value $x_0$.

A vector of length 3 whose entries sum to 1. Entry p[k] is the probability that each observation belong to class k with class center O[k, ]. Defaults to rep(1/3, 3).

omega

A vector of length 3 with positive entries. Entry omega[k] is the standard deviation of $W$ in class k (on a logit scale). Defaults to rep(1/2, 3).

Sigma1

A 2x2 covariance matrix of the random vector $(X,Y)$ in class k=1, assumed to be bivariate Gaussian with mean

O[1, c("X",
"Y")]

. Defaults to matrix(c(1, 1/sqrt(2), 1/sqrt(2), 1), 2, 2).

sigma2

A positive numeric, the variance of the random variable $Y$ in class k=2, assumed to be univariate Gaussian with mean

O[2,
"Y"]

. Defaults to omega[2]/5.

Sigma3

A 2x2 covariance matrix of the random vector $(X,Y)$ in class k=3, assumed to be bivariate Gaussian with mean

O[3, c("X",
"Y")]

. Defaults to Sigma1.

A function involved in the definition of the parameter of interest $\psi$, which must satisfy $f(0)=0$ (see Details). Defaults to identity.

verbose

Prescribes the amount of information output by the function. Defaults to FALSE.

Value

obs: A matrix of n observations. The c(W,X,Y) colums of obs have the same interpretation as the columns of the input argument O.
psi: A numeric, approximated value of the true parameter $\psi$ obtained using the known $\theta$ and the joint empirical distribution of $(X,W)$ based on the same run of observations as in obs. The larger the sample size, the more accurate the approximation.
g: A function, the known positive conditional probability $P(X=x_0|W)$.
mu: A function, the known conditional expectation $E_P(X|W)$.
muAux: A function, the known conditional expectation $E_P(X|X\neq x_0, W)$.
theta: A function, the known conditional expectation $E_P(Y|X,W)$.
theta0: A function, the known conditional expectation $E_P(Y|X=x_0,W)$.
sigma2: A positive numeric, the known expectation $E_P(f(X-x_0)^2)$.
effIC: A function, the known efficient influence curve of the functional $\Psi$ at $P$, assuming that the reference value $x_0=0$.
varIC: A positive numeric, approximated value of the variance of the efficient influence curve of the functional $\Psi$ at $P$ and evaluated at $O$, obtained using the same run of observations as in obs. The larger the sample size, the more accurate the approximation.

Details

The parameter of interest is defined as $\psi=\Psi(P)$ with $$\Psi(P) = \frac{E_P[f(X-x_0) * (\theta(X,W) - \theta(x_0,W))]}{E_P[f(X-x_0)^2]},$$ with $P$ the distribution of the random vector $(W,X,Y)$, $\theta(X,W) = E_P[Y|X,W]$, $x_0$ the reference value for $X$, and $f$ a user-supplied function such that $f(0)=0$ (e.g., $f=identity$, the default value). The value $\psi$ is obtained using the known $\theta$ and the joint empirical distribution of $(X,W)$ based on the same run of observations as in obs. Seeing $W, X, Y$ as DNA methylation, DNA copy number and gene expression, respectively, the simulation scheme implements the following constraints:

There are two or three copy number classes: normal regions (k=2), and regions of copy number gains and/or losses (k=1 and/or k=3).
In normal regions, gene expression levels are negatively correlated with DNA methylation.
In regions of copy number alteration, copy number and expression are positively correlated.

References

Chambaz, A., Neuvial, P., & van der Laan, M. J. (2012). Estimation of a non-parametric variable importance measure of a continuous exposure. Electronic journal of statistics, 6, 1059--1099.

Examples

Run this code

## Parameters for the simulation (case 'f=identity')
O <- cbind(W=c(0.05218652, 0.01113460),
           X=c(2.722713, 9.362432),
           Y=c(-0.4569579, 1.2470822))
O <- rbind(NA, O)
lambda0 <- function(W) {-W}
p <- c(0, 1/2, 1/2)
omega <- c(0, 3, 3)
S <- matrix(c(10, 1, 1, 0.5), 2 ,2)

## Simulating a data set of 200 i.i.d. observations
sim <- getSample(2e2, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
str(sim)

obs <- sim$obs
head(obs)
pairs(obs)

## Adding (dummy) baseline covariates
V <- matrix(runif(3*nrow(obs)), ncol=3)
colnames(V) <- paste("V", 1:3, sep="")
obsV <- cbind(V, obs)
head(obsV)

## True psi and confidence intervals (case 'f=identity')      
sim01 <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
truePsi1 <- sim01$psi

confInt01 <- truePsi1+c(-1, 1)*qnorm(.975)*sqrt(sim01$varIC/nrow(sim01$obs))
confInt1 <- truePsi1+c(-1, 1)*qnorm(.975)*sqrt(sim01$varIC/nrow(obs))
msg <- "\nCase f=identity:\n"
msg <- c(msg, "\ttrue psi is: ", paste(signif(truePsi1, 3)), "\n")
msg <- c(msg, "\t95%-confidence interval for the approximation is: ", 
         paste(signif(confInt01, 3)), "\n")
msg <- c(msg, "\toptimal 95%-confidence interval is: ", 
         paste(signif(confInt1, 3)), "\n")
cat(msg)

## True psi and confidence intervals (case 'f=atan')
f2 <- function(x) {1*atan(x/1)}
sim02 <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S, f=f2);
truePsi2 <- sim02$psi;

confInt02 <- truePsi2+c(-1, 1)*qnorm(.975)*sqrt(sim02$varIC/nrow(sim02$obs))
confInt2 <- truePsi2+c(-1, 1)*qnorm(.975)*sqrt(sim02$varIC/nrow(obs))

msg <- "\nCase f=atan:\n"
msg <- c(msg, "\ttrue psi is: ", paste(signif(truePsi2, 3)), "\n")
msg <- c(msg, "\t95%-confidence interval for the approximation is: ", 
         paste(signif(confInt02, 3)), "\n")
msg <- c(msg, "\toptimal 95%-confidence interval is: ", 
         paste(signif(confInt2, 3)), "\n")
cat(msg)

Run the code above in your browser using DataLab