Learn R Programming

tmle.npvi (version 0.10.0)

getSample: Generates Simulated Data

Description

Generates a run of simulated observations of the form $(W,X,Y)$ to investigate the "effect" of $X$ on $Y$ taking $W$ into account.

Usage

getSample(n, O, lambda0, p = rep(1/3, 3), omega = rep(1/2, 3), Sigma1 = matrix(c(1, 1/sqrt(2), 1/sqrt(2), 1), 2, 2), sigma2 = omega[2]/5, Sigma3 = Sigma1, f = identity, verbose = FALSE)

Arguments

n
An integer, the number of observations to be generated.
O
A 3x3 numeric matrix or data.frame. Rows are 3 baseline observations used as "class centers" for the simulation. Columns are:
  • $W$, baseline covariate (e.g. DNA methylation), or "confounder" in a causal model.
  • $X$, continuous exposure variable (e.g. DNA copy number), or "cause" in a causal model , with a reference value $x_0$ equal to O[2, "X"].
  • $Y$, outcome variable (e.g. gene expression level), or "effect" in a causal model.
lambda0
A function that encodes the relationship between $W$ and $Y$ in observations with levels of $X$ equal to the reference value $x_0$.
p
A vector of length 3 whose entries sum to 1. Entry p[k] is the probability that each observation belong to class k with class center O[k, ]. Defaults to rep(1/3, 3).
omega
A vector of length 3 with positive entries. Entry omega[k] is the standard deviation of $W$ in class k (on a logit scale). Defaults to rep(1/2, 3).
Sigma1
A 2x2 covariance matrix of the random vector $(X,Y)$ in class k=1, assumed to be bivariate Gaussian with mean O[1, c("X", "Y")]. Defaults to matrix(c(1, 1/sqrt(2), 1/sqrt(2), 1), 2, 2).
sigma2
A positive numeric, the variance of the random variable $Y$ in class k=2, assumed to be univariate Gaussian with mean O[2, "Y"]. Defaults to omega[2]/5.
Sigma3
A 2x2 covariance matrix of the random vector $(X,Y)$ in class k=3, assumed to be bivariate Gaussian with mean O[3, c("X", "Y")]. Defaults to Sigma1.
f
A function involved in the definition of the parameter of interest $\psi$, which must satisfy $f(0)=0$ (see Details). Defaults to identity.
verbose
Prescribes the amount of information output by the function. Defaults to FALSE.

Value

obs
A matrix of n observations. The c(W,X,Y) colums of obs have the same interpretation as the columns of the input argument O.
psi
A numeric, approximated value of the true parameter $\psi$ obtained using the known $\theta$ and the joint empirical distribution of $(X,W)$ based on the same run of observations as in obs. The larger the sample size, the more accurate the approximation.
g
A function, the known positive conditional probability $P(X=x_0|W)$.
mu
A function, the known conditional expectation $E_P(X|W)$.
muAux
A function, the known conditional expectation $E_P(X|X\neq x_0, W)$.
theta
A function, the known conditional expectation $E_P(Y|X,W)$.
theta0
A function, the known conditional expectation $E_P(Y|X=x_0,W)$.
sigma2
A positive numeric, the known expectation $E_P(f(X-x_0)^2)$.
effIC
A function, the known efficient influence curve of the functional $\Psi$ at $P$, assuming that the reference value $x_0=0$.
varIC
A positive numeric, approximated value of the variance of the efficient influence curve of the functional $\Psi$ at $P$ and evaluated at $O$, obtained using the same run of observations as in obs. The larger the sample size, the more accurate the approximation.

Details

The parameter of interest is defined as $\psi=\Psi(P)$ with $$\Psi(P) = \frac{E_P[f(X-x_0) * (\theta(X,W) - \theta(x_0,W))]}{E_P[f(X-x_0)^2]},$$ with $P$ the distribution of the random vector $(W,X,Y)$, $\theta(X,W) = E_P[Y|X,W]$, $x_0$ the reference value for $X$, and $f$ a user-supplied function such that $f(0)=0$ (e.g., $f=identity$, the default value). The value $\psi$ is obtained using the known $\theta$ and the joint empirical distribution of $(X,W)$ based on the same run of observations as in obs. Seeing $W, X, Y$ as DNA methylation, DNA copy number and gene expression, respectively, the simulation scheme implements the following constraints:
  • There are two or three copy number classes: normal regions (k=2), and regions of copy number gains and/or losses (k=1 and/or k=3).
  • In normal regions, gene expression levels are negatively correlated with DNA methylation.
  • In regions of copy number alteration, copy number and expression are positively correlated.

References

Chambaz, A., Neuvial, P., & van der Laan, M. J. (2012). Estimation of a non-parametric variable importance measure of a continuous exposure. Electronic journal of statistics, 6, 1059--1099.

Examples

Run this code
## Parameters for the simulation (case 'f=identity')
O <- cbind(W=c(0.05218652, 0.01113460),
           X=c(2.722713, 9.362432),
           Y=c(-0.4569579, 1.2470822))
O <- rbind(NA, O)
lambda0 <- function(W) {-W}
p <- c(0, 1/2, 1/2)
omega <- c(0, 3, 3)
S <- matrix(c(10, 1, 1, 0.5), 2 ,2)

## Simulating a data set of 200 i.i.d. observations
sim <- getSample(2e2, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
str(sim)

obs <- sim$obs
head(obs)
pairs(obs)

## Adding (dummy) baseline covariates
V <- matrix(runif(3*nrow(obs)), ncol=3)
colnames(V) <- paste("V", 1:3, sep="")
obsV <- cbind(V, obs)
head(obsV)

## True psi and confidence intervals (case 'f=identity')      
sim01 <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
truePsi1 <- sim01$psi

confInt01 <- truePsi1+c(-1, 1)*qnorm(.975)*sqrt(sim01$varIC/nrow(sim01$obs))
confInt1 <- truePsi1+c(-1, 1)*qnorm(.975)*sqrt(sim01$varIC/nrow(obs))
msg <- "\nCase f=identity:\n"
msg <- c(msg, "\ttrue psi is: ", paste(signif(truePsi1, 3)), "\n")
msg <- c(msg, "\t95%-confidence interval for the approximation is: ", 
         paste(signif(confInt01, 3)), "\n")
msg <- c(msg, "\toptimal 95%-confidence interval is: ", 
         paste(signif(confInt1, 3)), "\n")
cat(msg)

## True psi and confidence intervals (case 'f=atan')
f2 <- function(x) {1*atan(x/1)}
sim02 <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S, f=f2);
truePsi2 <- sim02$psi;

confInt02 <- truePsi2+c(-1, 1)*qnorm(.975)*sqrt(sim02$varIC/nrow(sim02$obs))
confInt2 <- truePsi2+c(-1, 1)*qnorm(.975)*sqrt(sim02$varIC/nrow(obs))

msg <- "\nCase f=atan:\n"
msg <- c(msg, "\ttrue psi is: ", paste(signif(truePsi2, 3)), "\n")
msg <- c(msg, "\t95%-confidence interval for the approximation is: ", 
         paste(signif(confInt02, 3)), "\n")
msg <- c(msg, "\toptimal 95%-confidence interval is: ", 
         paste(signif(confInt2, 3)), "\n")
cat(msg)

Run the code above in your browser using DataLab