backShift: Estimate connectivity matrix of a directed graph with linear effects and hidden variables.

Description

This function estimates the connectivity matrix of a directed (possibly cyclic) graph with hidden variables. The underlying system is required to be linear and we assume that observations under different shift interventions are available. More precisely, the function takes as an input an (nxp) data matrix, where $n$ is the sample size and $p$ the number of variables. In each environment $j$ ($j$ in {$1, \ldots, J$}) we have observed $n_j$ samples generated from $$X_j= X_j * A + c_j + e_j$$ (in case of cycles this should be understood as an equilibrium distribution). The $c_j$ is a p-dimensional random vector that is assumed to have a diagonal covariance matrix. The noise vector $e_j$ is assumed to have the same distribution in all environments $j$ but is allowed to have an arbitrary covariance matrix. The different intervention settings are provided to the method with the help of the vector ExpInd of length $n = (n_1 + ... + n_j + ... + n_J)$. The goal is to estimate the connectivity matrix $A$.

Usage

backShift(X, ExpInd, covariance=TRUE, ev=0, threshold =0.75, nsim=100,
          sampleSettings=1/sqrt(2), sampleObservations=1/sqrt(2),
          nodewise=TRUE, tolerance = 10^(-4), baseSettingEnv = 1,
          verbose = FALSE)

Arguments

A (nxp)-dimensional matrix (or data frame) with n observations of p variables.

ExpInd

Indicator of the experiment or the intervention type an observation belongs to. A numeric vector of length n. Has to contain at least three different unique values.

covariance

A boolean variable. If TRUE, use only shift in covariance matrix; otherwise use shift in Gram matrix. Set only to FALSE if at most one variable has a non-zero shift in mean in the same setting (default is TRUE).

The expected number of false selections for stability selection. No stability selection computed if ev=0. Defaults to ev=0.

threshold

The selection threshold for stability selection (has to be between 0.5 and 1). Edges which are selected with empirical proportion higher than threshold will be retained.

nsim

Number of resamples taken (if using stability selection).

sampleSettings

The proportion of unique settings to resample for each resample; has to be in [0,1].

sampleObservations

The fraction of all samples to retain when subsampling (no replacement); has to be in [0,1].

nodewise

If FALSE, stability selection retains for each subsample the largest overall entries in the connectivity matrix. If TRUE, values are ordered row- and node-wise first and then the largest entries in each row and column are retained. Error control is valid (under exchangeability assumption) in both cases. The latter setting TRUE is perhaps more robust and is the default.

tolerance

Precision parameter for ffdiag: the algorithm stops when the criterium difference between two iterations is less than tolerance. Default is 10^(-4).

baseSettingEnv

Index for baseline environment against which the intervention variances are measured. Defaults to 1.

verbose

If FALSE, most messages are supressed.

Value

A list with elements

Ahat

The connectivity matrix where entry (i,j) is the effect pointing from variable i to variable j.

AhatAdjacency

If ev>0, the connectivity matrix retained by stability selection. Entries give the rounded percentage of times the edge has been retained (and 0 if below the critical threshold).

varianceEnv

The estimated interventions variances up to an offset. varianceEnv is a (Gxp)-dimensional matrix where G is the number of unique environments. The ij-th entry contains the difference between the estimated intervention variance of variable j in environment i and the estimated intervention variance of variable j in the base setting (given by input parameter baseSettingEnv).

References

Dominik Rothenhaeusler, Christina Heinze, Jonas Peters, Nicolai Meinshausen: backShift: Learning causal cyclic graphs from unknown shift interventions. Advances in Neural Information Processing Systems (NIPS) 28, 2015. arXiv: http://arxiv.org/abs/1506.02494

Examples

Run this code

# NOT RUN {
## Simulate data with connectivity matrix A

seed <- 1
# sample size n
n <- 10000
# 3 predictor variables
p  <- 3
A <- diag(p)*0
A[1,2] <- 0.8
A[2,3] <- -0.8
A[3,1] <- 0.8

# divide data into 10 different environments
G <- 10

# simulate
simulation.res <- simulateInterventions(
                    n, p, A, G, intervMultiplier = 2,
                    noiseMult = 1, nonGauss = FALSE,
                    fracVarInt = 0.5, hidden = TRUE,
                    knownInterventions = FALSE,
                    simulateObs = TRUE, seed)

environment <- simulation.res$environment
X <- simulation.res$X

## Compute feedback estimator with stability selection

network <- backShift(X, environment, ev = 1)

## Print point estimates and stable edges

# true connectivity matrix
print(A)
# point estimate
print(network$Ahat)
# shows empirical selection probability for stable edges
print(network$AhatAdjacency)
# }

Run the code above in your browser using DataLab