getRanking: Estimate a ranking of edges for causal relations in the underlying graph structure using stability ranking.

Description

Estimates a ranking of edges for a given query, e.g. for parental relations in the underlying causal graph structure, using various possible methods.

Supported methods at the moment are ARGES, backShift, bivariateANM, bivariateCAM, CAM, FCI, FCI+, GES, GIES, hiddenICP, ICP, LINGAM, MMHC, rankARGES, rankFci, rankGES, rankGIES, rankPC, regression, RFCI and PC.

Usage

getRanking(
  X,
  environment,
  interventions = NULL,
  queries = c("isParent", "isMaybeParent", "isNoParent", "isAncestor",
    "isMaybeAncestor", "isNoAncestor"),
  method = c("ICP", "hiddenICP", "backShift", "pc", "LINGAM", "ges", "gies", "CAM",
    "fci", "rfci", "regression", "bivariateANM", "bivariateCAM")[1],
  alpha = 0.1,
  variableSelMat = NULL,
  excludeTargetInterventions = TRUE,
  onlyObservationalData = FALSE,
  indexObservationalData = NULL,
  setOptions = list(),
  assumeNoSelectionVars = TRUE,
  nsim = 100,
  sampleSettings = 1/sqrt(2),
  sampleObservations = 1/sqrt(2),
  verbose = FALSE,
  ...
)

Arguments

A $(n x p)$-data matrix with $n$ observations of $p$ variables.

environment

A vector of length $n$, where the entry for observation $i$ is an index for the environment in which observation $i$ took place (simplest case entries 1 for observational data and entries 2 for interventional data of unspecified type). Is required for methods ICP, hiddenICP, backShift.

interventions

A optional list of length n. The entry for observation i is a numeric vector that specifies the variables on which interventions happened for observation i (a scalar if an intervention happened on just one variable and numeric(0) if no intervention occured for this observation). Is used for method gies but will generate the vector environment if this is set to NULL (even though it might generate too many different environments for some data so a hand-picked vector environment is preferable). Is also used for ICP and hiddenICP to exclude interventions on the target variable of interest.

queries

One (or more of) "isParent", "isMaybeParent", "isNoParent", "isAncestor","isMaybeAncestor", "isNoAncestor"

method

A string that specfies the method to use. The methods pc (PC-algorithm), LINGAM (LINGAM), arges (Adaptively restricted greedy equivalence search), ges (Greedy equivalence search), gies (Greedy interventional equivalence search), fci (Fast causal inference) and rfci (Really fast causal inference) are imported from the package "pcalg" and are documented there in more detail, including the additional options that can be supplied via setOptions. The method CAM (Causal additive models) is documented in the package "CAM" and the methods ICP (Invariant causal prediction), hiddenICP (Invariant causal prediction with hidden variables) are from the package "InvariantCausalPrediction". The method backShift comes from the package "backShift". The method mmhc comes from the package "bnlearn". Finally, the methods bivariateANM and bivariateCAM are for now implemented internally but will hopefully be part of another package at some point in the near future.

alpha

The level at which tests are done. This leads to confidence intervals for ICP and hiddenICP and is used internally for pc and rfci.

variableSelMat

An optional logical matrix of dimension (pxp). An entry TRUE for entry (i,j) says that variable i should be considered as a potential parent for variable j and vice versa for FALSE. If the default value of NULL is used, all variables will be considered, but this can be very slow, especially for methods pc, ges, gies, rfci and CAM.

excludeTargetInterventions

When looking for parents of variable k in 1,...,p, set to TRUE if observations where an intervention on variable k occured should be excluded. Default is TRUE.

onlyObservationalData

If set to TRUE, only observational data is used. It will take the index in environment specified by indexObservationalData. If environment is NULL, all observations are used. Default is FALSE.

indexObservationalData

Index in environment that encodes observational data. Default is 1.

setOptions

A list that can take method-specific options; see the individual documentations of the methods for more options and their possible values.

assumeNoSelectionVars

Set to TRUE is you want to assume the absence of selection variables.

nsim

The number of resamples for stability selection.

sampleSettings

The fraction of different environments to resample in each resampling (at least two different environments will be selected so the argument is without effect if there are just two different environments in total).

sampleObservations

The fraction of samples to resample in each environment.

verbose

If TRUE, detailed output is provided.

...

Parameters to be passed to underlying method's function.

Value

A list with the following entries:

ranking A list of length length(queries). For each query, the corresponding list entry contains a matrix of dimension $(p x p) x 2$ with the ranking of edges. E.g. the first row indicates that the edge from ranking$isParent[1,1] to ranking$isParent[1,2] is the most likely edge according to the method under consideration.
resList A list of length length(queries). For each query, the corresponding list entry contains a matrix of dimension $(p x p)$ with the counts for $A_{i,j} = 1$ across the nsim subsamples.
simEstimates A list of length nsim with the method's output for each of the nsim subsamples.

Details

For both parental and ancestral relations, three queries are supported. The existence of a relation is assessed by the queries isParent and isAncestor; the absence of a relation is assessed by the queries isNoParent and isNoAncestor; the potential existence of a relation is addressed by the queries isMaybeParent and isMaybeAncestor.

All queries return a connectivity matrix which we denote by $A$. The interpretation of the entries of $A$ differs according to the considered query:

Parental relations: Queries concerning parental relations can only be answered by those methods under consideration that return a DAG, a CPDAG or a directed cyclic graph. When we say that a particular method cannot answer a given query, then the method's output with respect to this query will be the zero matrix. However, the eventual ranking for such a query will not necessarily be random due to the tie breaking scheme that is applied when ranking pairs of variables (see below).

isParent In the connectivity matrix $A$ returned by this query, the entry $A_{i,j} = 1$ means that there is a directed edge from node $i$ to node $j$ in the graph structure estimated by the method under consideration. Otherwise, $A_{i,j} = 0$.
isMaybeParent $A_{i,j} = 1$ means that there is a directed or an undirected edge from node $i$ to node $j$ in the estimated graph structure. Otherwise, $A_{i,j} = 0$.
isNoParent $A_{i,j} = 1$ means that there is neither a directed nor an undirected edge from node $i$ to node $j$ in the estimated graph structure. Otherwise, $A_{i,j} = 0$.

Ancestral relations: Queries concerning ancestral relations can be answered by all methods under consideration.

isAncestor $A_{i,j} = 1$ means that there is a directed path from node $i$ to node $j$ in the estimated graph structure. Otherwise, $A_{i,j} = 0$. In case of PAGs, directed paths can contain the edge types $i --> j$ and $i --o j$. Including the latter edge type in this category implies that we exclude the existence of selection variables.
isMaybeAncestor $A_{i,j} = 1$ then means that there is a path from node $i$ to node $j$ that contains directed and/or undirected edges. Otherwise, $A_{i,j} = 0$. For PAGs, such paths can contain the edge types $i --> j$, $i --o j$, $i o-o j$ and/or $i o-> j$. Otherwise, $A_{i,j} = 0$.
isNoAncestor $A_{i,j} = 1$ means that there is neither a directed path nor a partially directed path from node $i$ to node $j$ in the estimated graph structure. Otherwise, $A_{i,j} = 0$.

Stability ranking: To obtain a ranking of edges for a given set of queries, we run the method under consideration on nsims random subsamples of the data. In each round, we draw samples from a fraction of settings, where the size of the fraction is specified by sampleSettings. In each chosen setting, we sample a fraction of observations uniformly at random without replacement, where the size of the fraction is specified by sampleObservations.

For each subsample we randomly permute the order of the variables in the input. Methods that are order-dependent can therefore not exploit any potential advantage stemming from a data matrix with columns ordered according to the causal ordering or a similar one. We then run the method on each subsample.

For each subsample and a particular query, we obtain the corresponding connectivity matrix $A$. We can then rank all pairs of nodes $i,j$ according to the frequency of the occurrence of $A_{i,j} = 1$ across subsamples. Ties between pairs of variables can be broken with the results of the other queries if they are also computed as specified by queries; otherwise ties are broken at random:

If the query is isParent, ties are broken with counts for isMaybeParent.
For the query isMaybeParent ties are broken with counts for isParent, i.e. in case of equal counts we give a preference to the edge that was considered more often to be a 'certain' parent. For methods returning DAGs this scheme makes the ranking for isMaybeParent equal to the result for isParent, up to the random tie breaking that is applied for isParent.
If the query is isNoParent, ties are broken according to which edge was selected less often in the query isMaybeParent.
If the query is isAncestor, ties are broken with counts for isMaybeAncestor.
For the query isMaybeAncestor ties are broken with counts for isAncestor, i.e. in case of equal counts we give a preference to the edge that was considered more often to be a 'certain' ancestor. For methods returning DAGs this scheme makes the ranking for isMaybeAncestor equal to the result for isAncestor, up to the random tie breaking that is applied for isAncestor.
If the query is isNoAncestor, ties are broken according to which one was selected less often in the query isMaybeAncestor.

If the tie breaking matrix defined according to these rules is 0, a matrix with standard normal random entries is used to break ties. Similarly, if there are remaining ties after applying the tie breaking rules described above, ties are broken randomly.

Examples

Run this code

# NOT RUN {
data("simDataInv")
X <- simDataInv$X
set.seed(1)
if(require(pcalg)){
  rank <- getRanking(X,
                environment = simDataInv$environment,
                queries = c("isParent","isMaybeParent"),
                method = c("LINGAM"),
                verbose = FALSE)
  # estimated ranking
  print(rank$ranking$isParent)
 
  # true adjacency matrix
  print(simDataInv$configs$trueA)
}else{
  cat("\nThe packages 'pcalg' is needed for the example to
work. Please install it.")
}

# }