Learn R Programming

simcausal

The simcausal R package is a tool for specification and simulation of complex longitudinal data structures that are based on structural equation models (SEMs). The emphasis is on the types of simulations frequently encountered in causal inference problems, such as, observational data with time-dependent confounding, selection bias, and random monitoring processes. The interface allows for quick expression of dependencies between a large number of time-varying nodes.

Installation

To install the CRAN release version of simcausal:

install.packages('simcausal')

To install the development version (requires the devtools package):

devtools::install_github('osofr/simcausal', build_vignettes = FALSE)

Documentation

Once the package is installed, see the vignette, consult the internal package documentation and examples.

  • To see the vignette in R:
vignette("simcausal_vignette", package="simcausal")
  • To see all available package documentation:
?simcausal
help(package = 'simcausal')
  • To see the latest updates for the currently installed version of the package:
news(package = "simcausal")

Brief overview

Below is an example simulating data with 4 covariates specified by 4 structural equations (nodes). New equations are added by using successive calls to + node() function and data are simulated by calling sim function:

library("simcausal")
D <- DAG.empty() +
  node("CVD", distr="rcat.b1", probs = c(0.5, 0.25, 0.25)) +
  node("A1C", distr="rnorm", mean = 5 + (CVD > 1)*10 + (CVD > 2)*5) +
  node("TI", distr="rbern", prob = plogis(-0.5 - 0.3*CVD + 0.2*A1C)) +
  node("Y", distr="rbern", prob = plogis(-3 + 1.2*TI + 0.1*CVD + 0.3*A1C))
D <- set.DAG(D)
dat <- sim(D,n=200)

To display the above SEM object as a directed acyclic graph:

plotDAG(D)

To allow the above nodes A1C, TI and Y to change over time, for time points t = 0,...,7, and keeping CVD the same, simply add t argument to node function and use the square bracket [...] vector indexing to reference time-varying nodes inside the node function expressions:

library("simcausal")
D <- DAG.empty() +
  node("CVD", distr="rcat.b1", probs = c(0.5, 0.25, 0.25)) +
  node("A1C", t=0, distr="rnorm", mean=5 + (CVD > 1)*10 + (CVD > 2)*5) +
  node("TI", t=0, distr="rbern", prob=plogis(-5 - 0.3*CVD + 0.5*A1C[t])) +

  node("A1C", t=1:7, distr="rnorm", mean=-TI[t-1]*10 + 5 + (CVD > 1)*10 + (CVD > 2)*5) +
  node("TI", t=1:7, distr="rbern", prob=plogis(-5 - 0.3*CVD + 0.5*A1C[t] + 1.5*TI[t-1])) +
  node("Y", t=0:7, distr="rbern", prob=plogis(-6 - 1.2*TI[t] + 0.1*CVD + 0.3*A1C[t]), EFU=TRUE)
D <- set.DAG(D)
dat.long <- sim(D,n=200)

The + action function allows defining counterfactual data under various interventions (e.g., static, dynamic, deterministic, or stochastic), which can be then simulated by calling sim function. In particular, the interventions may represent exposures to treatment regimens, the occurrence or non-occurrence of right-censoring events, or of clinical monitoring events.

In addition, the functions set.targetE, set.targetMSM and eval.target provide tools for defining and computing a few selected features of the distribution of the counterfactual data that represent common causal quantities of interest, such as, treatment-specific means, the average treatment effects and coefficients from working marginal structural models.

Using networks in SEMs

Function network provies support for networks simulations, in particular it enables defining and simulating SEM for dependent data. For example, a network sampling function like rnet.gnm (provided by the package, see ?rnet.gnm) can be used to specify and simulate dependent data from a network-based SEM. Start defining a SEM that uses the this network, with a +network syntax and providing "rnet.gnm" as a "netfun" argument to network function:

library("simcausal")
library("magrittr")
D <- DAG.empty() + network("ER.net", netfun = "rnet.gnm", m_pn = 50)

First define two IDD nodes W1 (categorical) and W2 (Bernoulli):

D <- D +
  node("W1", distr = "rcat.b1", probs = c(0.0494, 0.1823, 0.2806, 0.2680, 0.1651, 0.0546)) +
  node("W2", distr = "rbern", prob = plogis(-0.2 + W1/3))

New nodes (structural equations) can now be specified conditional on the past node values of observations connected to each unit i (friends of i). The friends are defined by the network matrix that is returned by theĀ above network generator rnet.gnm. Double square bracket syntax "[[...]]" allows referencing the node values of connected friends. Two special variables, "Kmax" and "nF" can be used along-side indexing "[[...]]". Kmax defines the maximal number of friends (maximal friend index) for all observation. When kth friend referenced in "Var[[k]]" doesn't exist, the default is to set that value to "NA". Adding the argument "replaceNAw0=TRUE" to node function changes such values from NA to 0. nF is another special variable, which is a vector of length n and each nF[i] is equal to the current number of friends for unit i. Any kind of summary function that can be applied to multiple time-varying nodes can be similarly applied to network-indexed nodes. For additional details, see the package documentation for the network function (?network) and the package vignette on conducting network simulations.

Define network variable "netW1" as the W1 values of the first friend and define binary exposure "A" so that probability of success for each unit 'i' for A is a logit-linear function of:

  1. W1[i],
  2. Sum of W1 values among all friends of i,
  3. Mean value of W2 among all friends of i.
dat.net <- {
  D + node("netW1.F1", distr = "rconst", const = W1[[1]]) +
  node("A", distr = "rbern",
              prob = plogis(2 + -0.5 * W1 +
                            -0.1 * sum(W1[[1:Kmax]]) +
                            -0.7 * ifelse(nF > 0, sum(W2[[1:Kmax]])/nF, 0)),
              replaceNAw0 = TRUE)} %>%
set.DAG() %>%
sim(n=1000)

The simulated data frame returned by sim() also contains the simulated network object, saved as a separate attribute. The network is saved as an R6 object of class NetIndClass, under attribute called "netind_cl". The field "NetInd" contains the network matrix, the field "Kmax" contains the maximum number of friends (number of columns in NetInd) and the field "nF" contains the vector for total number of friends for each observation (see ?NetIndClass for more information).

(Kmax <- attributes(dat.net)$netind_cl$Kmax)
NetInd_mat <- attributes(dat.net)$netind_cl$NetInd
head(NetInd_mat)
nF <- attributes(dat.net)$netind_cl$nF
head(nF)

Citation

To cite simcausal in publications, please use:

Sofrygin O, van der Laan MJ, Neugebauer R (2015). simcausal: Simulating Longitudinal Data with Causal Inference Applications. R package version 0.5.

Funding

The development of this package was partially funded through internal operational funds provided by the Kaiser Permanente Center for Effectiveness & Safety Research (CESR). This work was also partially supported through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1403-12506) and an NIH grant (R01 AI074345-07).

Copyright

This software is distributed under the GPL-2 license.

Copy Link

Version

Install

install.packages('simcausal')

Monthly Downloads

494

Version

0.5.7

License

GPL-2

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

October 19th, 2024

Functions in simcausal (0.5.7)

igraph.to.sparseAdjMat

Convert igraph Network Object into Sparse Adjacency Matrix
eval.target

Evaluate the True Value of the Causal Target Parameter
network

Define a Network Generator
doLTCF

Missing Variable Imputation with Last Time Point Value Carried Forward (LTCF)
node

Create Node Object(s)
distr.list

List All Custom Distribution Functions in simcausal.
plotDAG

Plot DAG
net.list

List All Custom Network Generator Functions in simcausal.
plotSurvEst

(EXPERIMENTAL) Plot Discrete Survival Function(s)
rconst

Constant (Degenerate) Distribution (Returns its Own Argument const)
parents

Show Node Parents Given DAG Object
rnet.gnm

Call igraph::sample_gnm to Generate Random Graph Object According to the G(n,m) Erdos-Renyi Model
rnet.SmWorld

Call igraph::sample_smallworld to Generate Random Graph Object from the Watts-Strogatz Small-World Model
rcategor.int

Random Sample from Base 1 (rcat.b1) or Base 0 (rcat.b0) Categorical (Integer) Distribution
rcat.factor

Random Sample for a Categorical Factor
rbern

Random Sample from Bernoulli Distribution
print.DAG.node

Print DAG.node Object
print.DAG

Print DAG Object
rnet.gnp

Call igraph::sample_gnp to Generate Random Graph Object According to the G(n,p) Erdos-Renyi Model
set.DAG

Create and Lock DAG Object
print.DAG.action

Print Action Object
simobs

Simulate Observed Data
simfull

Simulate Full Data (From Action DAG(s))
simcausal

Simulating Longitudinal Data with Causal Inference Applications
sim

Simulate Observed or Full Data from DAG Object
set.targetMSM

Define Causal Parameters with a Working Marginal Structural Model (MSM)
set.targetE

Define Non-Parametric Causal Parameters
vecfun.reset

Reset Custom Vectorized Function List
sparseAdjMat.to.NetInd

Convert Network from Sparse Adjacency Matrix into Network IDs Matrix
sparseAdjMat.to.igraph

Convert Network from Sparse Adjacency Matrix into igraph Object
rdistr.template

Template for Writing Custom Distribution Functions
vecfun.print

Print Names of Custom Vectorized Functions
vecfun.remove

Remove Custom Vectorized Functions
vecfun.add

Add Custom Vectorized Functions
vecfun.all.print

Print Names of All Vectorized Functions
NetInd.to.sparseAdjMat

Convert Network IDs Matrix into Sparse Adjacency Matrix
N

Subsetting/Indexing DAG Nodes
A

Subsetting/Indexing Actions Defined for DAG Object
DAG.empty

Initialize an empty DAG object
DF.to.longDT

Faster Conversion of Data from Wide to Long Format Using dcast.data.table
Define_sVar

Class for defining and evaluating user-specified summary measures (exprs_list)
add.action

Define and Add Actions (Interventions)
DF.to.long

Convert Data from Wide to Long Format Using reshape
add.nodes

Adding Node(s) to DAG
NetIndClass

R6 class for creating and storing a friend matrix (network IDs) for network data