ergm: Exponential-Family Random Graph Models

Description

ergm() is used to fit exponential-family random graph models (ERGMs), in which the probability of a given network, $y$, on a set of nodes is $h(y) \exp\{\eta(\theta) \cdot g(y)\}/c(\theta)$, where $h(y)$ is the reference measure (usually $h(y)=1$), $g(y)$ is a vector of network statistics for $y$, $\eta(\theta)$ is a natural parameter vector of the same length (with $\eta(\theta)=\theta$ for most terms), and $c(\theta)$ is the normalizing constant for the distribution. ergm() can return a maximum pseudo-likelihood estimate, an approximate maximum likelihood estimate based on a Monte Carlo scheme, or an approximate contrastive divergence estimate based on a similar scheme. (For an overview of the package ergm, see HuHa08e;textualergm and KrHu23e;textualergm.) AdHa07n,BeMo08p,Bu08sna,Bu08net,GoHa08s,GoKi09b,Ha03a,Ha03deg,HaGi10m,HaHu08s,HuHa06i,KaKr16s,Kr12e,MoHa08s,Sn02mergm

Usage

ergm(
  formula,
  response = NULL,
  reference = ~Bernoulli,
  constraints = ~.,
  obs.constraints = ~. - observed,
  offset.coef = NULL,
  target.stats = NULL,
  eval.loglik = getOption("ergm.eval.loglik"),
  estimate = c("MLE", "MPLE", "CD"),
  control = control.ergm(),
  verbose = FALSE,
  ...,
  basis = ergm.getnetwork(formula),
  newnetwork = c("one", "all", "none")
)
is.ergm(object)
# S3 method for ergm
is.na(x)
# S3 method for ergm
anyNA(x, ...)
# S3 method for ergm
nobs(object, ...)
# S3 method for ergm
alias(object, ...)
# S3 method for ergm
print(x, digits = max(3, getOption("digits") - 3), ...)
# S3 method for ergm
vcov(object, sources = c("all", "model", "estimation"), ...)

Value

ergm() returns an object of ergm that is a list consisting of the following elements:

coef

The Monte Carlo maximum likelihood estimate of $\theta$, the vector of coefficients for the model parameters.

sample

The $n\times p$ matrix of network statistics, where $n$ is the sample size and $p$ is the number of network statistics specified in the model, generated by the last iteration of the MCMC-based likelihood maximization routine. These statistics are centered with respect to the observed statistics or target.stats, unless missing data MLE is used.

sample.obs

As sample, but for the constrained sample.

iterations

The number of Newton-Raphson iterations required before convergence.

MCMCtheta

The value of $\theta$ used to produce the Markov chain Monte Carlo sample. As long as the Markov chain mixes sufficiently well, sample is roughly a random sample from the distribution of network statistics specified by the model with the parameter equal to MCMCtheta. If estimate="MPLE" then MCMCtheta equals the MPLE.

loglikelihood

The approximate change in log-likelihood in the last iteration. The value is only approximate because it is estimated based on the MCMC random sample.

gradient

The value of the gradient vector of the approximated loglikelihood function, evaluated at the maximizer. This vector should be very close to zero.

covar

Approximate covariance matrix for the MLE, based on the inverse Hessian of the approximated loglikelihood evaluated at the maximizer.

failure

Logical: Did the MCMC estimation fail?

network

Network passed on the left-hand side of formula. If target.stats are passed, it is replaced by the network returned by san().

newnetworks

If argument newnetwork is "all", a list of the final networks at the end of the MCMC simulation, one for each thread.

newnetwork

If argument newnetwork is "one" or "all", the first (possibly only) element of newnetworks.

coef.init

The initial value of $\theta$.

est.cov

The covariance matrix of the model statistics in the final MCMC sample.

coef.hist, steplen.hist, stats.hist, stats.obs.hist

For the MCMLE method, the history of coefficients, Hummel step lengths, and average model statistics for each iteration..

control

The control list passed to the call.

etamap

The set of functions mapping the true parameter theta to the canonical parameter eta (irrelevant except in a curved exponential family model)

formula

The original formula passed to ergm().

target.stats

The target.stats used during estimation (passed through from the Arguments)

target.esteq

Used for curved models to preserve the target mean values of the curved terms. It is identical to target.stats for non-curved models.

constraints

Constraints used during estimation (passed through from the Arguments)

reference

The reference measure used during estimation (passed through from the Arguments)

estimate

The estimation method used (passed through from the Arguments).

offset

vector of logical telling which model parameters are to be set at a fixed value (i.e., not estimated).

drop

If control$drop=TRUE, a numeric vector indicating which terms were dropped due to to extreme values of the corresponding statistics on the observed network, and how:

0: The term was not dropped.

-1

The term was at its minimum and the coefficient was fixed at -Inf.

+1

The term was at its maximum and the coefficient was fixed at +Inf.

estimable

A logical vector indicating which terms could not be estimated due to a constraints constraint fixing that term at a constant value.

info

A list with miscellaneous information that would typically be accessed by the user via methods; in general, it should not be accessed directly. Current elements include:

terms_dind

Logical indicator of whether the model terms are all dyad-independent.

space_dind

Logical indicator of whether the sample space (constraints) are all dyad-independent.

n_info_dyads

Number of “informative” dyads: those that are observed (not missing) and not constrained by sample space constraints; one of the measures of sample size.

obs

Logical indicator of whether an observational (missing data) process was involved in estimation.

valued

Logical indicator of whether the model is valued.

null.lik

Log-likelihood of the null model. Valid only for unconstrained models.

mle.lik

The approximate log-likelihood for the MLE. The value is only approximate because it is estimated based on the MCMC random sample.

Arguments

formula

An R formula, of the form y ~ <model terms>, where y is a network object or a matrix that can be coerced to a network object. For the details on the possible <model terms>, see ergmTerm and MoHa08sergm for binary ERGM terms and Kr12eergm for valued ERGM terms (terms for weighted edges). To create a network object in R, use the network() function, then add nodal attributes to it using the %v% operator if necessary. Enclosing a model term in offset() fixes its value to one specified in offset.coef. (A second argument---a logical or numeric index vector---can be used to select which of the parameters within the term are offsets.)

response

Either a character string, a formula, or NULL (the default), to specify the response attributes and whether the ERGM is binary or valued. Interpreted as follows:

NULL

Model simple presence or absence, via a binary ERGM.

character string

The name of the edge attribute whose value is to be modeled. Type of ERGM will be determined by whether the attribute is logical (TRUE/FALSE) for binary or numeric for valued.

a formula

must be of the form NAME~EXPR|TYPE (with | being literal). EXPR is evaluated in the formula's environment with the network's edge attributes accessible as variables. The optional NAME specifies the name of the edge attribute into which the results should be stored, with the default being a concise version of EXPR. Normally, the type of ERGM is determined by whether the result of evaluating EXPR is logical or numeric, but the optional TYPE can be used to override by specifying a scalar of the type involved (e.g., TRUE for binary and 1 for valued).

reference

A one-sided formula specifying the reference measure ($h(y)$) to be used. See help for ERGM reference measures implemented in the ergm package.

constraints

A formula specifying one or more constraints on the support of the distribution of the networks being modeled. Multiple constraints may be given, separated by “+” and “-” operators. See ergmConstraint for the detailed explanation of their semantics and also for an indexed list of the constraints visible to the ergm package.

The default is to have no constraints except those provided through the ergmlhs API.

Together with the model terms in the formula and the reference measure, the constraints define the distribution of networks being modeled.

It is also possible to specify a proposal function directly either by passing a string with the function's name (in which case, arguments to the proposal should be specified through the MCMC.prop.args argument to the relevant control function, or by giving it on the LHS of the hints formula to MCMC.prop argument to the control function. This will override the one chosen automatically.

Note that not all possible combinations of constraints and reference measures are supported. However, for relatively simple constraints (i.e., those that simply permit or forbid specific dyads or sets of dyads from changing), arbitrary combinations should be possible.

obs.constraints

A one-sided formula specifying one or more constraints or other modification in addition to those specified by constraints, following the same syntax as the constraints argument.

This allows the domain of the integral in the numerator of the partially obseved network face-value likelihoods of HaGi10mergm and KaKr16sergm to be specified explicitly.

The default is to constrain the integral to only integrate over the missing dyads (if present), after incorporating constraints provided through the ergmlhs API.

It is also possible to specify a proposal function directly by passing a string with the function's name of the obs.MCMC.prop argument to the relevant control function. In that case, arguments to the proposal should be specified through the obs.prop.args argument to the relevant control function.

offset.coef

A vector of coefficients for the offset terms. Note that NaN elements are treated specially. See Skipping below. If the vector is named, its names will be matched to the corresponding coefficient names, and if the named vector has a single coefficient without a name, it will be used for the unmatched coefficients. In particular, setNames(x,"") will be treated as a vector of xs.

target.stats

vector of "observed network statistics," if these statistics are for some reason different than the actual statistics of the network on the left-hand side of formula. Equivalently, this vector is the mean-value parameter values for the model. If this is given, the algorithm finds the natural parameter values corresponding to these mean-value parameters. If NULL, the mean-value parameters used are the observed statistics of the network in the formula. If the vector is named, its names will be matched to the corresponding statistic names, and if the named vector has a single coefficient without a name, it will be used for the unmatched statistics. In particular, setNames(x,"") will be treated as a vector of xs.

eval.loglik

Logical: For dyad-dependent models, if TRUE, use bridge sampling to evaluate the log-likelihoood associated with the fit. Has no effect for dyad-independent models. Since bridge sampling takes additional time, setting to FALSE may speed performance if likelihood values (and likelihood-based values like AIC and BIC) are not needed. Can be set globally via option(ergm.eval.loglik=...), which is set to TRUE when the package is loaded. (See options?ergm.)

estimate

If "MPLE," then the maximum pseudolikelihood estimator is returned. If "MLE" (the default), then an approximate maximum likelihood estimator is returned. For certain models, the MPLE and MLE are equivalent, in which case this argument is ignored. (To force MCMC-based approximate likelihood calculation even when the MLE and MPLE are the same, see the force.main argument of control.ergm(). If "CD" (EXPERIMENTAL), the Monte-Carlo contrastive divergence estimate is returned. )

control

A list of control parameters for algorithm tuning, typically constructed with control.ergm(). Its documentation gives the the list of recognized control parameters and their meaning. The more generic utility snctrl() (StatNet ConTRoL) also provides argument completion for the available control functions and limited argument name checking.

verbose

A logical or an integer to control the amount of progress and diagnostic information to be printed. FALSE/0 produces minimal output, with higher values producing more detail. Note that very high values (5+) may significantly slow down processing.

...

Additional arguments, to be passed to lower-level functions.

basis

a value (usually a network) to override the LHS of the formula.

newnetwork

One of "one" (the default), "all", or "none" (or, equivalently, FALSE), specifying whether the network(s) from the last iteration of the MCMC sampling should be returned as a part of the fit as a elements newnetwork and newnetworks. (See their entries in section Value below for details.) Partial matching is supported.

object

an ergm object.

x, digits

See print().

sources

For the vcov method, specify whether to return the covariance matrix from the ERGM model, the estimation process, or both combined.

Methods (by generic)

is.na(ergm): Return TRUE if the ERGM was fit to a partially observed network and/or an observational process, such as missing (NA) dyads.
anyNA(ergm): Alias to the is.na() method.
nobs(ergm): Return the number of informative dyads of a model fit.
alias(ergm): Extract a matrix of detected linear dependence among the model's sufficient statistics or estimating functions (if curved). Each row, if any, contains coefficients for a linear combination of the statistics that results in a constant. These are pretty-printed as a series of equations.
print(ergm): Print the call, the estimate, and the method used to obtain it.
vcov(ergm): extracts the variance-covariance matrix of parameter estimates.

Skipping MCMC iterations (advanced)

In some scenarios, it is helpful to forbid certain network configurations from being sampled this can be specified using constraints, or by creating an offset() term which has value 0 if the network is allowed and positive (negative) if the network is not, with offset coefficient set to -Inf (+Inf). Sometimes, however, a permitted configuration can be reached by "passing through" a forbidden one. For example, if isolates are possible but not nodes with degree exactly 1. Then, an offset term with coefficient NaN (not NA!) will cause the MCMC to not terminate as long as the value of that offset term is different from 0.

Note that this means that MCMC is not guaranteed to terminate, and there are very few safeguards at this time.

Notes on model specification

Although each of the statistics in a given model is a summary statistic for the entire network, it is rarely necessary to calculate statistics for an entire network in a proposed Metropolis-Hastings step. Thus, for example, if the triangle term is included in the model, a census of all triangles in the observed network is never taken; instead, only the change in the number of triangles is recorded for each edge toggle.

In the implementation of ergm(), the model is initialized in R, then all the model information is passed to a C program that generates the sample of network statistics using MCMC. This sample is then returned to R, which then uses one of several algorithms, selected by main.method= control.ergm() parameter to update the estimate.

The mechanism for proposing new networks for the MCMC sampling scheme, which is a Metropolis-Hastings algorithm, depends on two things: The constraints, which define the set of possible networks that could be proposed in a particular Markov chain step, and the weights placed on these possible steps by the proposal distribution. The former may be controlled using the constraints argument described above. The latter may be controlled using the prop.weights argument to the control.ergm() function.

The package is designed so that the user could conceivably add additional proposal types.

References

Examples

Run this code

# \donttest{
#
# load the Florentine marriage data matrix
#
data(flo)
#
# attach the sociomatrix for the Florentine marriage data
# This is not yet a network object.
#
flo
#
# Create a network object out of the adjacency matrix
#
flomarriage <- network(flo,directed=FALSE)
flomarriage
#
# print out the sociomatrix for the Florentine marriage data
#
flomarriage[,]
#
# create a vector indicating the wealth of each family (in thousands of lira) 
# and add it as a covariate to the network object
#
flomarriage %v% "wealth" <- c(10,36,27,146,55,44,20,8,42,103,48,49,10,48,32,3)
flomarriage
#
# create a plot of the social network
#
plot(flomarriage)
#
# now make the vertex size proportional to their wealth
#
plot(flomarriage, vertex.cex=flomarriage %v% "wealth" / 20, main="Marriage Ties")
#
# Use 'data(package = "ergm")' to list the data sets in a
#
data(package="ergm")
#
# Load a network object of the Florentine data
#
data(florentine)
#
# Fit a model where the propensity to form ties between
# families depends on the absolute difference in wealth
#
gest <- ergm(flomarriage ~ edges + absdiff("wealth"))
summary(gest)
#
# add terms for the propensity to form 2-stars and triangles
# of families 
#
gest <- ergm(flomarriage ~ kstar(1:2) + absdiff("wealth") + triangle)
summary(gest)

# import synthetic network that looks like a molecule
data(molecule)
# Add a attribute to it to mimic the atomic type
molecule %v% "atomic type" <- c(1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3)
#
# create a plot of the social network
# colored by atomic type
#
plot(molecule, vertex.col="atomic type",vertex.cex=3)

# measure tendency to match within each atomic type
gest <- ergm(molecule ~ edges + kstar(2) + triangle + nodematch("atomic type"))
summary(gest)

# compare it to differential homophily by atomic type
gest <- ergm(molecule ~ edges + kstar(2) + triangle
                        + nodematch("atomic type",diff=TRUE))
summary(gest)
# }
# \donttest{
# Extract parameter estimates as a numeric vector:
coef(gest)
# Sources of variation in parameter estimates:
vcov(gest, sources="model")
vcov(gest, sources="estimation")
vcov(gest, sources="all") # the default
# }

Run the code above in your browser using DataLab