network: Define a Network Generator

Description

Define a network generator by providing a function (using the argument netfun) which will simulate a network of connected friends for observations i in 1:n. This network then serves as a backbone for defining and simulating from the structural equation models for dependent data. In particular, the network allows new nodes to be defined as functions of the previously simulated node values of i's friends, across all observations i. Let F_i denote the set of friends of one observation i (observations in F_i are assumed to be "connected" to i) and refer to the union of these sets F_i as a "network" on n observations, denoted by F. A user-supplied network generating function netfun should be able to simulate such network F by returning a matrix of n rows, where each row i defines a friend set F_i, i.e., row i should be a vector of observations in 1:n that are connected to i (friends of i), with the remainder filled by NAs. Each friend set F_i can contain up to Kmax unique indices j from 1:n, except for i itself. F_i is also allowed to be empty (row i has only NAs), implying that i has no friends. The functionality is illustrated in the examples below. For additional information see Details. To learn how to use the node function for defining a node as a function of the friend node values, see Syntax and Network Summary Measures.

Usage

network(name, netfun, ..., params = list())

Value

A list containing the network object(s) of type DAG.net, this will be utilized when data is simulated with sim function.

Arguments

name: Character string specifiying the name of the current network, may be used for adding new network that replaces the existing one (resample previous network)
netfun: Character name of the user-defined network generating function, can be any R function that returns a matrix of friend IDs of dimension c(n, Kmax). The function must accept a named argument n that specifies the total sample size of the network. The matrix of network IDs should have n rows and Kmax columns, where each row i contains a vector of unique IDs in 1:n that are i's friends (observations that can influence i's node distribution), except for i itself. Arguments to netfun can be either passed as named arguments to network function itself or as a named list of parameters params. These network arguments can themselves be functions of the previously defined node names, allowing for network sampling itself to be dependent on the previously simulated node values, as shown in Example 2.
...: Named arguments specifying distribution parameters that are accepted by the network sampling function in netfun. These parameters can be R expressions that are themselves formulas of the past node names.
params: A list of additional named parameters to be passed on to the netfun function. The parameters have to be either constants or character strings of R expressions of the past node names.

Syntax

The network function call that defines the network of friends can be added to a growing DAG object by using '+' syntax, much like a new node is added to a DAG. Subsequently defined nodes (node function calls) can employ the double square bracket subsetting syntax to reference previously simulated node values for specific friends in F_i simultaneously across all observations i. For example, VarName[[net_indx]] can be used inside the node formula to reference the node VarName values of i's friends in F_i[net_indx], simultaneously across all i in 1:n.

The friend subsetting index net_indx can be any non-negative integer vector that takes values from 0 to Kmax, where 0 refers to the VarName node values of observation i itself (this is equivalent to just using VarnName in the node formula), net_indx value of 1 refers to node VarName values for observations in F_i[1], across all i in 1:n (that is, the value of VarName of i's first friend F_i[1], if the friend exists and NA otherwise), and so on, up to net_indx value of Kmax, which would reference to the last friend node values of VarName, as defined by observations in F_i[Kmax] across all i. Note that net_indx can be a vector (e.g, net_indx=c(1:Kmax)), in which case the result of the query VarName[[c(1:Kmax)]] is a matrix of Kmax columns and n rows.

By default, VarName[[j]] evaluates to missing (NA) when observation i does not have a friend under F_i[j] (i.e., in the jth spot of i's friend set). This default behavior however can be changed to return 0 instead of NA, by passing an additional argument replaceNAw0 = TRUE to the corresponding node function.

Network Summary Measures

One can also define summary measures of the network covariates by specifying a node formula that applies an R function to the result of VarName[[net_indx]]. The rules for defining and applying such summary measures are identical to the rules for defining summary measures for time-varying nodes VarName[t_indx]. For example, use sum(VarName[[net_indx]]) to define a summary measure as a sum of VarName values of friends in F_i[net_indx], across all observations i in 1:n. Similarly, use mean(VarName[[net_indx]]) to define a summary measure as a mean of VarName values of friends in F_i[net_indx], across all i. For more details on defining such summary functions see the simcausal vignette.

Details

Without the network of friends, the DAG objects constructed by calling the node function can only specify structural equation models for independent and identically distributed data. That is, if no network is specified, for each observation i a node can be defined conditionally only on i's own previously simulated node values. As a result, any two observations simulated under such data-generating model are always independent and identically distributed. Defining a network F allows one to define a new structural equation model where a node for each observation i can depend on its own simulated past, but also on the previously simulated node values of i's friends (F_i). This is accomplished by allowing the data generating distribution for each observation i's node to be defined conditionally on the past node values of i's friends (observations in F_i). The network of friends can be used in subsequent calls to node function where new nodes (random variables) defined by the node function can depend on the node values of i's friends (observations in the set F_i). During simulation it is assumed observations on F_i can simultaneously influence i.

Note that the current version of the package does not allow combining time-varying node indexing Var[t] and network node indexing Var[[net_indx]] for the same data generating distribution.

Each argument for the input network can be an evaluable R expression. All formulas are captured by delayed evaluation and are evaluated during the simulation. Formulas can refer to standard or user-specified R functions that must only apply to the values of previously defined nodes (i.e. node(s) that were called prior to network() function call).

Examples

Run this code

#--------------------------------------------------------------------------------------------------
# EXAMPLE 1. USING igraph R PACKAGE TO SIMULATE NETWORKS
#--------------------------------------------------------------------------------------------------

#--------------------------------------------------------------------------------------------------
# Example of a network sampler, will be provided as "netfun" argument to network(, netfun=);
# Generates a random graph according to the G(n,m) Erdos-Renyi model using the igraph package;
# Returns (n,Kmax) matrix of net IDs (friends) by row;
# Row i contains the IDs (row numbers) of i's friends;
# i's friends are assumed connected to i and can influence i in equations defined by node())
# When i has less than Kmax friends, the remaining i row entries are filled with NAs;
# Argument m_pn: > 0
# a total number of edges in the network as a fraction (or multiplier) of n (sample size)
#--------------------------------------------------------------------------------------------------
gen.ER <- function(n, m_pn, ...) {
  m <- as.integer(m_pn*n)
  if (n<=10) m <- 20
  igraph.ER <- igraph::sample_gnm(n = n, m = m, directed = TRUE)
  sparse_AdjMat <- igraph.to.sparseAdjMat(igraph.ER)
  NetInd_out <- sparseAdjMat.to.NetInd(sparse_AdjMat)
  return(NetInd_out$NetInd_k)
}

D <- DAG.empty()
# Sample ER model network using igraph::sample_gnm with m_pn argument:
D <- D + network("ER.net", netfun = "gen.ER", m_pn = 50)
# W1 - categorical (6 categories, 1-6):
D <- D +
  node("W1", distr = "rcat.b1",
        probs = c(0.0494, 0.1823, 0.2806, 0.2680, 0.1651, 0.0546)) +
# W2 - binary infection status, positively correlated with W1:
  node("W2", distr = "rbern", prob = plogis(-0.2 + W1/3)) +
# W3 - binary confounder:
  node("W3", distr = "rbern", prob = 0.6)
# A[i] is a function W1[i] and the total of i's friends values W1, W2 and W3:
D <- D + node("A", distr = "rbern",
              prob = plogis(2 + -0.5 * W1 +
                            -0.1 * sum(W1[[1:Kmax]]) +
                            -0.4 * sum(W2[[1:Kmax]]) +
                            -0.7 * sum(W3[[1:Kmax]])),
              replaceNAw0 = TRUE)
# Y[i] is a function of netW3 (friends of i W3 values) and the total N of i's friends
# who are infected AND untreated:
D <- D + node("Y", distr = "rbern",
              prob = plogis(-1 + 2 * sum(W2[[1:Kmax]] * (1 - A[[1:Kmax]])) +
                            -2 * sum(W3[[1:Kmax]])
                            ),
              replaceNAw0 = TRUE)
# Can add N untreated friends to the above outcome Y equation: sum(1 - A[[1:Kmax]]):
D <- D + node("Y", distr = "rbern",
              prob = plogis(-1 + 1.5 * sum(W2[[1:Kmax]] * (1 - A[[1:Kmax]])) +
                            -2 * sum(W3[[1:Kmax]]) +
                            0.25 * sum(1 - A[[1:Kmax]])
                            ),
              replaceNAw0 = TRUE)
# Can add N infected friends at baseline to the above outcome Y equation: sum(W2[[1:Kmax]]):
D <- D + node("Y", distr = "rbern",
              prob = plogis(-1 + 1 * sum(W2[[1:Kmax]] * (1 - A[[1:Kmax]])) +
                            -2 * sum(W3[[1:Kmax]]) +
                            0.25 * sum(1 - A[[1:Kmax]]) +
                            0.25 * sum(W2[[1:Kmax]])
                            ),
              replaceNAw0 = TRUE)
Dset <- set.DAG(D, n.test = 100)
# Simulating data from the above sem:
datnet <- sim(Dset, n = 1000, rndseed = 543)
head(datnet)
# Obtaining the network object from simulated data:
net_object <- attributes(datnet)$netind_cl
# Max number of friends:
net_object$Kmax
# Network matrix
head(attributes(datnet)$netind_cl$NetInd)

#--------------------------------------------------------------------------------------------------
# EXAMPLE 2. USING CUSTOM NETWORK GENERATING FUNCTION
#--------------------------------------------------------------------------------------------------

#--------------------------------------------------------------------------------------------------
# Example of a user-defined network sampler(s) function
# Arguments K, bslVar[i] (W1) & nF are evaluated in the environment of the simulated data then
# passed to genNET() function
  # - K: maximum number of friends for any unit
  # - bslVar[i]: used for contructing weights for the probability of selecting i as
  # someone else's friend (weighted sampling), when missing the sampling goes to uniform
  # - nF[i]: total number of friends that need to be sampled for observation i
#--------------------------------------------------------------------------------------------------
genNET <- function(n, K, bslVar, nF, ...) {
  prob_F <- plogis(-4.5 + 2.5*c(1:K)/2) / sum(plogis(-4.5 + 2.5*c(1:K)/2))
  NetInd_k <- matrix(NA_integer_, nrow = n, ncol = K)
  nFriendTot <- rep(0L, n)
  for (index in (1:n)) {
    FriendSampSet <- setdiff(c(1:n), index)
    nFriendSamp <- max(nF[index] - nFriendTot[index], 0L)
    if (nFriendSamp > 0) {
      if (length(FriendSampSet) == 1)  {
        friends_i <- FriendSampSet
      } else {
        friends_i <- sort(sample(FriendSampSet, size = nFriendSamp,
                          prob = prob_F[bslVar[FriendSampSet] + 1]))
      }
      NetInd_k[index, ] <- c(as.integer(friends_i),
                            rep_len(NA_integer_, K - length(friends_i)))
      nFriendTot[index] <- nFriendTot[index] + nFriendSamp
    }
  }
  return(NetInd_k)
}

D <- DAG.empty()
D <- D +
# W1 - categorical or continuous confounder (5 categories, 0-4):
  node("W1", distr = "rcat.b0",
        probs = c(0.0494, 0.1823, 0.2806, 0.2680, 0.1651, 0.0546)) +
# W2 - binary infection status at t=0, positively correlated with W1:
  node("W2", distr = "rbern", prob = plogis(-0.2 + W1/3)) +
# W3 - binary confounder:
  node("W3", distr = "rbern", prob = 0.6)

# def.nF: total number of friends for each i (0-K), each def.nF[i] is influenced by categorical W1
K <- 10
set.seed(12345)
normprob <- function(x) x / sum(x)
p_nF_W1_mat <- apply(matrix(runif((K+1)*6), ncol = 6, nrow = (K+1)), 2, normprob)
colnames(p_nF_W1_mat) <- paste0("p_nF_W1_", c(0:5))
create_probs_nF <- function(W1) t(p_nF_W1_mat[,W1+1])
vecfun.add("create_probs_nF")
D <- D + node("def.nF", distr = "rcat.b0", probs = create_probs_nF(W1))

# Adding the network generator that depends on nF and categorical W1:
D <- D + network(name="net.custom", netfun = "genNET", K = K, bslVar = W1, nF = def.nF)
# Define A[i] is a function W1[i] as well as the total sum of i's friends values for W1, W2 and W3:
D <- D + node("A", distr = "rbern",
              prob = plogis(2 + -0.5 * W1 +
                            -0.1 * sum(W1[[1:Kmax]]) +
                            -0.4 * sum(W2[[1:Kmax]]) +
                            -0.7 * sum(W3[[1:Kmax]])),
              replaceNAw0 = TRUE)
# Y[i] is a the total N of i's friends who are infected AND untreated
# + a function of friends W3 values
D <- D + node("pYRisk", distr = "rconst",
              const = plogis(-1 + 2 * sum(W2[[1:Kmax]] * (1 - A[[1:Kmax]])) +
                              -1.5 * sum(W3[[1:Kmax]])),
              replaceNAw0 = TRUE)

D <- D + node("Y", distr = "rbern", prob = pYRisk)
Dset <- set.DAG(D, n.test = 100)

# Simulating data from the above sem:
datnet <- sim(Dset, n = 1000, rndseed = 543)
head(datnet, 10)
# Obtaining the network object from simulated data:
net_object <- attributes(datnet)$netind_cl
# Max number of friends:
net_object$Kmax
# Network matrix
head(attributes(datnet)$netind_cl$NetInd)
plotDAG(Dset)

Run the code above in your browser using DataLab