Last chance! 50% off unlimited learning
Sale ends in
Estimate the equivalence class of a directed acyclic graph (DAG) from observational data, using the PC-algorithm.
pc(suffStat, indepTest, alpha, labels, p,
fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE, m.max = Inf,
u2pd = c("relaxed", "rand", "retry"),
skel.method = c("stable", "original", "stable.fast"),
conservative = FALSE, maj.rule = FALSE, solve.confl = FALSE,
numCores = 1, verbose = FALSE)
An object of class
"pcAlgo"
(see
pcAlgo
) containing an estimate of the equivalence
class of the underlying DAG.
A list
of sufficient statistics,
containing all necessary elements for the conditional independence
decisions in the function indepTest
.
A function
for testing conditional
independence. It is internally called as
indepTest(x,y,S,suffStat)
, and tests conditional independence
of x
and y
given S
. Here, x
and
y
are variables, and S
is a (possibly empty) vector of
variables (all variables are denoted by their (integer) column positions
in the adjacency matrix). suffStat
is a list, see the
argument above. The return value of indepTest
is the p-value
of the test for conditional independence.
significance level (number in
(optional) character vector of variable (or
“node”) names. Typically preferred to specifying p
.
(optional) number of variables (or nodes). May be specified
if labels
are not, in which case labels
is set to
1:p
.
Specifies the number of cores to be used for parallel
estimation of skeleton
.
If TRUE
, detailed output is provided.
A logical matrix of dimension p*p. If entry
[i,j]
or [j,i]
(or both) are TRUE, the edge i-j is
removed before starting the algorithm. Therefore, this edge is
guaranteed to be absent in the resulting graph.
A logical matrix of dimension p*p. If entry
[i,j]
or [j,i]
(or both) are TRUE, the edge i-j is
never considered for removal. Therefore, this edge is
guaranteed to be present in the resulting graph.
If indepTest returns NA
and this option is
TRUE
, the corresponding edge is deleted. If this option is
FALSE
, the edge is not deleted.
Maximal size of the conditioning sets that are considered in the conditional independence tests.
String specifying the method for dealing with conflicting information when trying to orient edges (see details below).
Character string specifying method; the default,
"stable"
provides an order-independent skeleton, see
skeleton
.
Logical indicating if the conservative PC is used.
In this case, only option u2pd = "relaxed"
is supported.
Note that therefore the resulting object might not be extendable to
a DAG. See details for more information.
Logical indicating that the triples shall be checked for ambiguity using a majority rule idea, which is less strict than the conservative PC algorithm. For more information, see details.
If TRUE
, the orientation of the v-structures and the
orientation rules work with lists for candidate sets and allow
bi-directed edges to resolve conflicting edge orientations. In this
case, only option u2pd = relaxed
is supported. Note, that
therefore the resulting object might not be a CPDAG because
bi-directed edges might be present. See details
for more information.
Markus Kalisch (kalisch@stat.math.ethz.ch), Martin Maechler, and Diego Colombo.
Under the assumption that the distribution of the observed variables is faithful to a DAG, this function estimates the Markov equivalence class of the DAG. We do not estimate the DAG itself, because this is typically impossible (even with an infinite amount of data), since different DAGs can describe the same conditional independence relationships. Since all DAGs in an equivalence class describe the same conditional independence relationships, they are equally valid ways to describe the conditional dependence structure that was given as input.
All DAGs in a Markov equivalence class have the same skeleton (i.e., the same adjacency information) and the same v-structures (see definition below). However, the direction of some edges may be undetermined, in the sense that they point one way in one DAG in the equivalence class, while they point the other way in another DAG in the equivalence class.
A Markov equivalence class can be uniquely represented by a completed
partially directed acyclic graph (CPDAG). A CPDAG
contains undirected and directed edges. The edges have the following
interpretation: (i) there is a (directed or undirected) edge between i
and j if and only if variables i and j are conditionally dependent
given S for all possible subsets S of the remaining nodes; (ii) a directed
edge
The CPDAG is estimated using the PC algorithm (named after its inventors
Peter Spirtes and Clark Glymour). The skeleton is
estimated by the function skeleton
which uses a modified
version of the original PC algorithm (see Colombo and Maathuis (2014) for
details). The original PC algorithm is known to be
order-dependent, in the sense that the output depends on the order in
which the variables are given. Therefore, Colombo and Maathuis (2014)
proposed a simple modification, called PC-stable, that yields
order-independent adjacencies in the skeleton (see the help file
of this function for details). Subsequently, as many edges as possible
are oriented. This is done in two steps. It is important to note that
if no further actions are taken (see below) these two steps still
remain order-dependent.
The edges are oriented as follows. First, the algorithm considers all
triples (a,b,c)
, where sepset
. The structure
After determining all v-structures, there may still be undirected edges. It may be possible to direct some of these edges, since one can deduce that one of the two possible directions of the edge is invalid because it introduces a new v-structure or a directed cycle. Such edges are found by repeatedly applying rules R1-R3 of the PC algorithm as given in Algorithm 2 of Kalisch and Bühlmann (2007). The algorithm stops if none of the rules is applicable to the graph.
The conservative PC algorithm (conservative = TRUE
) is a
slight variation of the PC algorithm (see Ramsey et al. 2006). After
the skeleton is computed, all potential v-structures maj.rule = TRUE
. In this case, the triple
Sampling errors (or hidden variables) can lead to conflicting
information about edge directions. For example, one may find that
solve.confl = FALSE
, in such cases, we simply overwrite the
directions of the conflicting edge. In the example above this means
that we obtain
solve.confl = TRUE
(which is only
supported with option u2pd = "relaxed"
), we first generate a list
of all (unambiguous) v-structures (in the example above M[b,c] = 2
and M[c,b] = 2
. In a similar
way, using lists for the candidate edges for each orientation rule and
allowing bi-directed edges, the order-dependence issues in the orientation
rules can be resolved. Note that bi-directed edges merely represent a
conflicting orientation and they should not to be interpreted causally. The
useage of these lists for the candidate edges and allowing bi-directed edges
resolves the order-dependence issues on the orientation of the v-structures
and on the orientation rules, see Colombo and Maathuis (2014) for
more details.
Note that calling (conservative = TRUE
), or maj.rule =
TRUE
, together with solve.confl = TRUE
produces a fully
order-independent output, see Colombo and Maathuis (2014).
Sampling errors, non faithfulness, or hidden variables can also lead
to non-extendable CPDAGs, meaning that there does not exist a DAG that
has the same skeleton and v-structures as the graph found by the
algorithm. An example of this is an undirected cycle consisting of the
edges u2pd
specifies what should be done in such a situation. If the
option is set to "relaxed"
, the algorithm simply outputs the
invalid CPDAG. If the option is set to "rand"
, all direction
information is discarded and a random DAG is generated on the
skeleton, which is then converted into its CPDAG. If the option is set
to "retry"
, up to 100 combinations of possible directions of
the ambiguous edges are tried, and the first combination that results
in an extendable CPDAG is chosen. If no valid combination is found, an
arbitrary DAG is generated on the skeleton as in the option "rand",
and then converted into its CPDAG. Note that the output can also be an
invalid CPDAG, in the sense that it cannot arise from the oracle PC
algorithm, but be extendible to a DAG, for example
u2pd
is not used.
Using the function isValidGraph
one can check if the final output is indeed a valid CPDAG.
Notes: (1) Throughout, the algorithm works with the column positions of the variables in the adjacency matrix, and not with the names of the variables. (2) When plotting the object, undirected and bidirected edges are equivalent.
D. Colombo and M.H. Maathuis (2014).Order-independent constraint-based causal structure learning. Journal of Machine Learning Research 15 3741-3782.
M. Kalisch, M. Maechler, D. Colombo, M.H. Maathuis and P. Buehlmann (2012). Causal Inference Using Graphical Models with the R Package pcalg. Journal of Statistical Software 47(11) 1--26, tools:::Rd_expr_doi("10.18637/jss.v047.i11").
M. Kalisch and P. Buehlmann (2007). Estimating high-dimensional directed acyclic graphs with the PC-algorithm. JMLR 8 613-636.
J. Ramsey, J. Zhang and P. Spirtes (2006). Adjacency-faithfulness and conservative causal inference. In Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence. AUAI Press, Arlington, VA.
P. Spirtes, C. Glymour and R. Scheines (2000). Causation, Prediction, and Search, 2nd edition. The MIT Press.
skeleton
for estimating a skeleton of a DAG;
udag2pdag
for converting the
skeleton to a CPDAG; gaussCItest
,
disCItest
, binCItest
and
dsepTest
as examples for indepTest
. isValidGraph
for testing whether the output is a valid CPDAG.
##################################################
## Using Gaussian Data
##################################################
## Load predefined data
data(gmG)
n <- nrow (gmG8$ x)
V <- colnames(gmG8$ x) # labels aka node names
## estimate CPDAG
pc.fit <- pc(suffStat = list(C = cor(gmG8$x), n = n),
indepTest = gaussCItest, ## indep.test: partial correlations
alpha=0.01, labels = V, verbose = TRUE)
if (require(Rgraphviz)) {
## show estimated CPDAG
par(mfrow=c(1,2))
plot(pc.fit, main = "Estimated CPDAG")
plot(gmG8$g, main = "True DAG")
}
##################################################
## Using d-separation oracle
##################################################
## define sufficient statistics (d-separation oracle)
suffStat <- list(g = gmG8$g, jp = RBGL::johnson.all.pairs.sp(gmG8$g))
## estimate CPDAG
fit <- pc(suffStat, indepTest = dsepTest, labels = V,
alpha= 0.01) ## value is irrelevant as dsepTest returns either 0 or 1
if (require(Rgraphviz)) {
## show estimated CPDAG
plot(fit, main = "Estimated CPDAG")
plot(gmG8$g, main = "True DAG")
}
##################################################
## Using discrete data
##################################################
## Load data
data(gmD)
V <- colnames(gmD$x)
## define sufficient statistics
suffStat <- list(dm = gmD$x, nlev = c(3,2,3,4,2), adaptDF = FALSE)
## estimate CPDAG
pc.D <- pc(suffStat,
## independence test: G^2 statistic
indepTest = disCItest, alpha = 0.01, labels = V, verbose = TRUE)
if (require(Rgraphviz)) {
## show estimated CPDAG
par(mfrow = c(1,2))
plot(pc.D, main = "Estimated CPDAG")
plot(gmD$g, main = "True DAG")
}
##################################################
## Using binary data
##################################################
## Load binary data
data(gmB)
V <- colnames(gmB$x)
## estimate CPDAG
pc.B <- pc(suffStat = list(dm = gmB$x, adaptDF = FALSE),
indepTest = binCItest, alpha = 0.01, labels = V, verbose = TRUE)
pc.B
if (require(Rgraphviz)) {
## show estimated CPDAG
plot(pc.B, main = "Estimated CPDAG")
plot(gmB$g, main = "True DAG")
}
##################################################
## Detecting ambiguities due to sampling error
##################################################
## Load predefined data
data(gmG)
n <- nrow (gmG8$ x)
V <- colnames(gmG8$ x) # labels aka node names
## estimate CPDAG
pc.fit <- pc(suffStat = list(C = cor(gmG8$x), n = n),
indepTest = gaussCItest, ## indep.test: partial correlations
alpha=0.01, labels = V, verbose = TRUE)
## due to sampling error, some edges were overwritten:
isValidGraph(as(pc.fit, "amat"), type = "cpdag")
## re-fit with solve.confl = TRUE
pc.fit2 <- pc(suffStat = list(C = cor(gmG8$x), n = n),
indepTest = gaussCItest, ## indep.test: partial correlations
alpha=0.01, labels = V, verbose = TRUE,
solve.confl = TRUE)
## conflicting edge is V5 - V6
as(pc.fit2, "amat")
Run the code above in your browser using DataLab