simulateInterventions: Simulate data of a causal (possibly cyclic model) under interventions.

Description

Simulate data of a causal (possibly cyclic model) under interventions.

Usage

simulateInterventions(
  n,
  p,
  df,
  rhoNoise,
  snrPar,
  sparse,
  doInterv,
  numberInt,
  strengthInt,
  cyclic,
  strengthCycle,
  modelMis = FALSE,
  modelMisPar = 1,
  seed = 1
)

Arguments

Number of observations.

Number of variables.

Degrees of freedom in t-distribution of noise and interventions.

rhoNoise

Correlation between noise terms to model hidden variabkes. Set to 0 for independent noise.

snrPar

Signal-to-noise parameter: steers what proportion of the variance stems from the signal resp.\ from the noise: The SNR is given by $SNR = (1-snrPar)/snrPar$), see details. Only holds when cyclic = FALSE.

sparse

Probability that an entry $i,j$ in adjacency matrix is 1.

doInterv

Set to TRUE if interventions should be do-interventions; otherwise noise interventions (also called shift interventions) are generated.

numberInt

Total number of settings.

strengthInt

Regulates the strength of the interventions, see details.

cyclic

Set to TRUE is resulting graph should contain a cycle.

strengthCycle

Steers strength of feedback, see details.

modelMis

Add a model misspecification that applies tanh(modelMisPar*x)/modelMisPar) morginally to each variable after having generated X from the causal DAG.

modelMisPar

Parameter steering the strength of the model misspecification.

seed

Random seed.

Value

A list with the following elements:

X $n x p$-dimensional data matrix
environment Indicator of the experiment or the intervention type an observation belongs to. A numeric vector of length $n$.
interventions A list of length $n$. Indicates location of interventions for each data point.
whereInt A list of length numberInt. Indicates location of interventions in each setting.
noise
configs A list with the generated adjacency matrix (trueA) as well as all input arguments.

Details

The adjacency matrix $A$ is generated as follows. Assume the variables with indices ${1, \ldots, p}$ are causally ordered. For each edge from node $i$ to node $j$ where $i$ precedes $j$ in the causal ordering, we draw a sample from Bin(sparse) to determine whether to add an edge from node $i$ to node $j$. After having sampled the non-zero entries of $A$ in this fashion, we sample the coefficients from Unif(-1,1). As described below, the edge weights are later rescaled to achieve a specified signal-to-noise ratio. We exclude the possibility of $A = 0$, i.e. we resample until $A$ contains at least one non-zero entry.

Second, the interventions are generated as follows. numberInt denotes the total number of (interventional and observational) settings that are generated. For each variable, we sample uniformly at random with replacement one setting in which this variable is intervened on. In other words, each variable is intervened on in exactly one setting. Hence it is possible that there are settings where no interventions take place which then correspond to the observational case. Similarly, there may be settings where interventions are performed on multiple variables at once. After defining the settings, we sample (uniformly at random with replacement) what setting each data point belongs to. So for each setting we generate approximately the same number of samples. In one generated data set, the interventions are all of the same type, i.e. they are either all shift interventions (when doInterv = FALSE) or do-interventions (when doInterv = TRUE). In both cases, an intervention on $X_j$ is modelled by generating $Z_j$ as $Z_j ~$ strengthInt $* t$(dfNoise). If strengthInt = 0, all interventional settings correspond to purely observational data.

Third, the noise terms $\epsilon$ are generated by first sampling from $N(0,\Sigma)$ where $\Sigma_{i,i} = 1$ and $\Sigma_{i,j} =$ rhoNoise. To steer the signal-to-noise ratio, we set the variance of the noise terms of all nodes except source nodes to snrPar where $0 < $snrPar$ \le 1$. Stepping through the variables in causal order, for each variable $X_j$ that has parents, we uniformly rescale the edge weights $\beta_{j,k}$ for $k = 1, \ldots, p$ in the structural equation of variable $X_j$ such that the variance of the sum $\sum_{k=1}^p \beta_{j,k} X_k + \epsilon_j$ is approximately 1 in the observational setting. In other words, the parameter snrPar steers what proportion of the variance stems from the signal given by $\sum_{k=1}^p \beta_{j,k} X_k$ and what proportion stems from the noise $\epsilon_j$. The signal-to-noise ratio can then be computed as SNR = (1-snrPar)/snrPar.

Forth, a cycle is added to the causal graph if cyclic = TRUE. If the causal graph shall contain a cycle, we sample two nodes $i$ and $j$ such that adding an edge between them creates a cycle in the causal graph. We then compute the largest possible coefficient for this edge such that the cycle product is smaller than 1. Subsequently, we sample the sign of the coefficient and set the magnitude by scaling the largest possible coefficient by strengthCycle where $0 < $strengthCycle$< 1$.

Fifth, we rescale the noise variables to obtain a $t$-distribution with dfNoise degrees of freedom. $X$ is then generated as $X = (I-A)^{-1}\epsilon$ in the observational case; under a shift interventions $X$ can be generated as $X = (I-A)^{-1}(\epsilon + Z)$ where the coordinates of $Z$ are only non-zero for the variables that are intervened on. Under a do-intervention on $X_j$, $\beta_{j,k}$ for $k = 1, \ldots, p$ are set to 0 to yield $A'$ and $\epsilon_j$ is set to $Z_j$ to yield $\epsilon_j'$. We then obtain $X$ as $X = (I-A')^{-1}\epsilon'$.

Lastly, if modelMis = TRUE a model misspecification is added to the data by marginally transforming all variables as tanh(modelMisPar*x)/modelMisPar).