cpquery: Perform conditional probability queries

Description

Perform conditional probability queries (CPQs).

Usage

cpquery(fitted, event, evidence, cluster, method = "ls", ...,
  debug = FALSE)
cpdist(fitted, nodes, evidence, cluster, method = "ls", ...,
  debug = FALSE)
mutilated(x, evidence)

Value

cpquery() returns a numeric value, the conditional probability of

event() conditional on evidence.

cpdist() returns a data frame containing the samples generated from the conditional distribution of the nodes conditional on

evidence(). The data frame has class c("bn.cpdist", "data.frame"), and a meth, -8od attribute storing the value of the method

argument. In the case of likelihood weighting, the weights are also attached as an attribute called weights.

mutilated returns a bn or bn.fit object, depending on the class of x.

Arguments

fitted: an object of class bn.fit.
x: an object of class bn or bn.fit.
event, evidence: see below.
nodes: a vector of character strings, the labels of the nodes whose conditional distribution we are interested in.
cluster: an optional cluster object from package parallel.
method: a character string, the method used to perform the conditional probability query. Currently only logic sampling (ls, the default) and likelihood weighting (lw) are implemented.
...: additional tuning parameters.
debug: a boolean value. If TRUE a lot of debugging output is printed; otherwise the function is completely silent.

Logic Sampling

Logic sampling is an approximate inference algorithm.

The event and evidence arguments must be two expressions describing the event of interest and the conditioning evidence in a format such that, if we denote with data the data set the network was learned from, data[evidence, ] and data[event, ] return the correct observations. If either event or evidence is set to TRUE an unconditional probability query is performed with respect to that argument.

Three tuning parameters are available:

n: a positive integer number, the number of random samples to generate from fitted. The default value is 5000 * log10(nparams(fitted)) for discrete and conditional Gaussian networks and 500 * nparams(fitted) for Gaussian networks.
batch: a positive integer number, the number of random samples that are generated at one time. Defaults to 10^4. If the n is very large (e.g. 10^12), R would run out of memory if it tried to generate them all at once. Instead random samples are generated in batches of size batch, discarding each batch before generating the next.
query.nodes: a vector of character strings, the labels of the nodes involved in event and evidence. Simple queries do not require to generate samples from all the nodes in the network, so cpquery and cpdist try to identify which nodes are used in event and evidence and reduce the network to their upper closure. query.nodes may be used to manually specify these nodes when automatic identification fails; there is no reason to use it otherwise.

Note that the number of samples returned by cpdist() is always smaller than n, because logic sampling is a form of rejection sampling. Therefore, only the observations matching evidence (out of the n that are generated) are returned, and their number depends on the probability of evidence. Furthermore, the probabilities returned by cpquery() are approximate estimates and they will not sum up to 1 even when the corresponding underlying values do if they are computed in separate calls to cpquery().

Likelihood Weighting

Likelihood weighting is an approximate inference algorithm based on Monte Carlo sampling.

The event argument must be an expression describing the event of interest, as in logic sampling. The evidence argument must be a named list:

Each element corresponds to one node in the network and must contain the value that node will be set to when sampling.
In the case of a continuous node, two values can also be provided. In that case, the value for that node will be sampled from a uniform distribution on the interval delimited by the specified values.
In the case of a discrete or ordinal node, two or more values can also be provided. In that case, the value for that node will be sampled with uniform probability from the set of specified values.

If either event or evidence is set to TRUE an unconditional probability query is performed with respect to that argument.

Tuning parameters are the same as for logic sampling: n, batch and query.nodes.

Note that the samples returned by cpdist() are generated from the mutilated network, and need to be weighted appropriately when computing summary statistics (for more details, see the references below). cpquery does that automatically when computing the final conditional probability. Also note that the batch argument is ignored in cpdist() for speed and memory efficiency. Furthermore, the probabilities returned by cpquery() are approximate estimates and they will not sum up to 1 even when the corresponding underlying values do if they are computed in separate calls to cpquery().

Author

Marco Scutari

Details

cpquery estimates the conditional probability of event given evidence using the method specified in the method argument.

cpdist generates random samples conditional on the evidence using the method specified in the method argument.

mutilated constructs the mutilated network arising from an ideal intervention setting the nodes involved to the values specified by evidence. In this case evidence must be provided as a list in the same format as for likelihood weighting (see below).

Note that both cpquery and cpdist are based on Monte Carlo particle filters, and therefore they may return slightly different values on different runs due to simulation noise.

References

Koller D, Friedman N (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.

Korb K, Nicholson AE (2010). Bayesian Artificial Intelligence. Chapman & Hall/CRC, 2nd edition.

Examples

Run this code

## discrete Bayesian network (it is the same with ordinal nodes).
data(learning.test)
fitted = bn.fit(hc(learning.test), learning.test)
# the result should be around 0.025.
cpquery(fitted, (B == "b"), (A == "a"))
# programmatically build a conditional probability query...
var = names(learning.test)
obs = 2
str = paste("(", names(learning.test)[-3], " == '",
        sapply(learning.test[obs, -3], as.character), "')",
        sep = "", collapse = " & ")
str
str2 = paste("(", names(learning.test)[3], " == '",
         as.character(learning.test[obs, 3]), "')", sep = "")
str2

cmd = paste("cpquery(fitted, ", str2, ", ", str, ")", sep = "")
eval(parse(text = cmd))
# ... but note that predict works better in this particular case.
attr(predict(fitted, "C", learning.test[obs, -3], prob = TRUE), "prob")
# do the same with likelihood weighting.
cpquery(fitted, event = eval(parse(text = str2)),
  evidence = as.list(learning.test[2, -3]), method = "lw")
attr(predict(fitted, "C", learning.test[obs, -3],
               method = "bayes-lw", prob = TRUE), "prob")
# conditional distribution of A given C == "c".
table(cpdist(fitted, "A", (C == "c")))

## Gaussian Bayesian network.
data(gaussian.test)
fitted = bn.fit(hc(gaussian.test), gaussian.test)
# the result should be around 0.04.
cpquery(fitted,
  event = ((A >= 0) & (A <= 1)) & ((B >= 0) & (B <= 3)),
  evidence = (C + D < 10))

## ideal interventions and mutilated networks.
mutilated(fitted, evidence = list(F = 42))

Run the code above in your browser using DataLab