san: Generate networks with a given set of network statistics

Description

This function attempts to find a network or networks whose statistics match those passed in via the target.stats vector.

Usage

san(object, ...)
# S3 method for formula
san(
  object,
  response = NULL,
  reference = ~Bernoulli,
  constraints = ~.,
  target.stats = NULL,
  nsim = NULL,
  basis = NULL,
  output = c("network", "edgelist", "ergm_state"),
  only.last = TRUE,
  control = control.san(),
  verbose = FALSE,
  offset.coef = NULL,
  ...
)
# S3 method for ergm_model
san(
  object,
  reference = ~Bernoulli,
  constraints = ~.,
  target.stats = NULL,
  nsim = NULL,
  basis = NULL,
  output = c("network", "edgelist", "ergm_state"),
  only.last = TRUE,
  control = control.san(),
  verbose = FALSE,
  offset.coef = NULL,
  ...
)

Value

A network or list of networks that hopefully have network statistics close to the target.stats vector. No guarantees are provided about their probability distribution. Additionally, attr()-style attributes formula and stats are included.

Arguments

object

Either a formula or some other supported representation of an ERGM, such as an ergm_model object. formula should be of the form y ~ <model terms>, where y is a network object or a matrix that can be coerced to a network object. For the details on the possible <model terms>, see ergmTerm. To create a network object in , use the network() function, then add nodal attributes to it using the %v% operator if necessary.

...

Further arguments passed to other functions.

response

Either a character string, a formula, or NULL (the default), to specify the response attributes and whether the ERGM is binary or valued. Interpreted as follows:

NULL

Model simple presence or absence, via a binary ERGM.

character string

The name of the edge attribute whose value is to be modeled. Type of ERGM will be determined by whether the attribute is logical (TRUE/FALSE) for binary or numeric for valued.

a formula

must be of the form NAME~EXPR|TYPE (with | being literal). EXPR is evaluated in the formula's environment with the network's edge attributes accessible as variables. The optional NAME specifies the name of the edge attribute into which the results should be stored, with the default being a concise version of EXPR. Normally, the type of ERGM is determined by whether the result of evaluating EXPR is logical or numeric, but the optional TYPE can be used to override by specifying a scalar of the type involved (e.g., TRUE for binary and 1 for valued).

reference

A one-sided formula specifying the reference measure ($h(y)$) to be used. See help for ERGM reference measures implemented in the ergm package.

constraints

A formula specifying one or more constraints on the support of the distribution of the networks being modeled. Multiple constraints may be given, separated by “+” and “-” operators. See ergmConstraint for the detailed explanation of their semantics and also for an indexed list of the constraints visible to the ergm package.

The default is to have no constraints except those provided through the ergmlhs API.

Together with the model terms in the formula and the reference measure, the constraints define the distribution of networks being modeled.

It is also possible to specify a proposal function directly either by passing a string with the function's name (in which case, arguments to the proposal should be specified through the MCMC.prop.args argument to the relevant control function, or by giving it on the LHS of the hints formula to MCMC.prop argument to the control function. This will override the one chosen automatically.

Note that not all possible combinations of constraints and reference measures are supported. However, for relatively simple constraints (i.e., those that simply permit or forbid specific dyads or sets of dyads from changing), arbitrary combinations should be possible.

target.stats

A vector of the same length as the number of non-offset statistics implied by the formula.

nsim

Number of networks to generate. Deprecated: just use replicate().

basis

If not NULL, a network object used to start the Markov chain. If NULL, this is taken to be the network named in the formula.

output

Character, one of "network" (default), "edgelist", or "ergm_state": determines the output format. Partial matching is performed.

only.last

if TRUE, only return the last network generated; otherwise, return a network.list with nsim networks.

control

A list of control parameters for algorithm tuning, typically constructed with control.san(). Its documentation gives the the list of recognized control parameters and their meaning. The more generic utility snctrl() (StatNet ConTRoL) also provides argument completion for the available control functions and limited argument name checking.

verbose

A logical or an integer to control the amount of progress and diagnostic information to be printed. FALSE/0 produces minimal output, with higher values producing more detail. Note that very high values (5+) may significantly slow down processing.

offset.coef

A vector of offset coefficients; these must be passed in by the user. Note that these should be the same set of coefficients one would pass to ergm via its offset.coef argument.

formula

(By default, the formula is taken from the ergm object. If a different formula object is wanted, specify it here.

Methods (by class)

san(formula): Sufficient statistics are specified by a formula.
san(ergm_model): A lower-level function that expects a pre-initialized ergm_model.

Details

The following description is an exegesis of section 4 of Krivitsky et al. (2022).

Let $\mathbf{g}$ be a vector of target statistics for the network we wish to construct. That is, we are given an arbitrary network $\mathbf{y}^0 \in \mathcal{Y}$, and we seek a network $\mathbf{y} \in \mathcal{Y}$ such that $\mathbf{g}(\mathbf{y}) \approx \mathbf{g}$ -- ideally equality is achieved, but in practice we may have to settle for a close approximation. The variant of simulated annealing is as follows.

The energy function is defined

$$E_W (\mathbf{y}) = (\mathbf{g}(\mathbf{y}) - \mathbf{g})^\mathsf{T} W (\mathbf{g}(\mathbf{y}) - \mathbf{g}),$$

with $W$ a symmetric positive (barring multicollinearity in statistics) definite matrix of weights. This function achieves 0 only if the target is reached. A good choice of this matrix yields a more efficient search.

A standard simulated annealing loop is used, as described below, with some modifications. In particular, we allow the user to specify a vector of offsets $\eta$ to bias the annealing, with $\eta_k = 0$ denoting no offset. Offsets can be used with SAN to forbid certain statistics from ever increasing or decreasing. As with ergm(), offset terms are specified using the offset() decorator and their coefficients specified with the offset.coef argument. By default, finite offsets are ignored by, but this can be overridden by setting the control.san() argument SAN.ignore.finite.offsets = FALSE.

The number of simulated annealing runs is specified by the SAN.maxit control parameter and the initial value of the temperature $T$ is set to SAN.tau. The value of $T$ decreases linearly until $T = 0$ at the last run, which implies that all proposals that increase $E_W (\mathbf{y})$ are rejected. The weight matrix $W$ is initially set to $I_p / p$, where $I_p$ is the identity matrix of an appropriate dimension. For weight $W$ and temperature $T$, the simulated annealing iteration proceeds as follows:

Test if $E_W(\mathbf{y}) = 0$. If so, then exit.
Generate a perturbed network $\mathbf{y^*}$ from a proposal that respects the model constraints. (This is typically the same proposal as that used for MCMC.)
Store the quantity $\mathbf{g}(\mathbf{y^*}) - \mathbf{g}(\mathbf{y})$ for later use.
Calculate acceptance probability

$$\alpha = \exp[ - (E_W (\mathbf{y^*}) - E_W (\mathbf{y})) / T + \eta^\mathsf{T} (\mathbf{g}(\mathbf{y^*}) - \mathbf{g}(\mathbf{y}))]$$

(If $|\eta_k| = \infty$ and $g_k (\mathbf{y^*}) - g_k (\mathbf{y}) = 0$, their product is defined to be 0.)
Replace $\mathbf{y}$ with $\mathbf{y^*}$ with probability $\min(1, \alpha)$.

After the specified number of iterations, $T$ is updated as described above, and $W$ is recalculated by first computing a matrix $S$, the sample covariance matrix of the proposed differences stored in Step 3 (i.e., whether or not they were rejected), then $W = S^+ / tr(S^+)$, where $S^+$ is the Moore–Penrose pseudoinverse of $S$ and $tr(S^+)$ is the trace of $S^+$. The differences in Step 3 closely reflect the relative variances and correlations among the network statistics.

In Step 2, the many options for MCMC proposals can provide for effective means of speeding the SAN algorithm's search for a viable network.

References

Krivitsky, P. N., Hunter, D. R., Morris, M., & Klumb, C. (2022). ergm 4: Computational Improvements. arXiv preprint arXiv:2203.08198.

Examples

Run this code

# \donttest{
# initialize x to a random undirected network with 50 nodes and a density of 0.1
x <- network(50, density = 0.05, directed = FALSE)
 
# try to find a network on 50 nodes with 300 edges, 150 triangles,
# and 1250 4-cycles, starting from the network x
y <- san(x ~ edges + triangles + cycle(4), target.stats = c(300, 150, 1250))

# check results
summary(y ~ edges + triangles + cycle(4))

# initialize x to a random directed network with 50 nodes
x <- network(50)

# add vertex attributes
x %v% 'give' <- runif(50, 0, 1)
x %v% 'take' <- runif(50, 0, 1)

# try to find a set of 100 directed edges making the outward sum of
# 'give' and the inward sum of 'take' both equal to 62.5, so in
# edges (i,j) the node i tends to have above average 'give' and j
# tends to have above average 'take'
y <- san(x ~ edges + nodeocov('give') + nodeicov('take'), target.stats = c(100, 62.5, 62.5))

# check results
summary(y ~ edges + nodeocov('give') + nodeicov('take'))


# initialize x to a random undirected network with 50 nodes
x <- network(50, directed = FALSE)

# add a vertex attribute
x %v% 'popularity' <- runif(50, 0, 1)

# try to find a set of 100 edges making the total sum of
# popularity(i) and popularity(j) over all edges (i,j) equal to
# 125, so nodes with higher popularity are more likely to be
# connected to other nodes
y <- san(x ~ edges + nodecov('popularity'), target.stats = c(100, 125))
 
# check results
summary(y ~ edges + nodecov('popularity'))

# creates a network with denser "core" spreading out to sparser
# "periphery"
plot(y)
# }

Run the code above in your browser using DataLab