fitabn.mle: Fit an additive Bayesian network model based on maximum likelihood estimation.

Description

Fits an additive Bayesian network to observed data and is equivalent to classical generalized Linear multi-dimensional regression modelling. The estimation is done through an Iterative Reweighed Least Square scheme.

Usage

fitabn.mle(dag.m = NULL,
       data.df = NULL, 
       data.dists = NULL,
       adj.vars = NULL,
       cor.vars = NULL,
       centre = TRUE,
       maxit = 100, 
       tol = 10^-11)

Arguments

dag.m

a matrix or a formula statement defining the network structure, a directed acyclic graph (DAG), see details for format. Note that colnames and rownames must be set.

data.df

a data frame containing the data used for learning the network, binary variables must be declared as factors and no missing values all allowed in any variable.

data.dists

a named list giving the distribution for each node in the network, see details.

adj.vars

a character vector giving the column names in data.df for which the network score has to be adjusted for, see details.

cor.vars

a character vector giving the column names in data.df for which adjustment should be used.

centre

logical variable, should the observations in each Gaussian node first be standardised to mean zero and standard deviation one, defaults is TRUE.

maxit

integer given the maximum number of run for estimating network scores using an Iterative Reweighed Least Square algorithm.

tol

real number giving the minimal tolerance expected to terminate the Iterative Reweighed Least Square algorithm to estimate network score.

Value

A named list. One entry for each of the variables in data.df (excluding the adjusting variable, if present) which contains an estimate of the log marginal likelihood for that individual node. An entry "mlik" which is the total log marginal likelihood for the full ABN model. Same entries for aic, bic and mdl. The regression coefficients, standard error and P-value.

Details

The procedure fitabn.mle fits an additive Bayesian network model to data where each node (variable - a column in data.df) can be either: presence/absence (Bernoulli); continuous (Gaussian); an unbounded count (Poisson); or a discrete variable (Multinomial). The model comprises of a set of conditionally independent generalized linear regressions with or without adjustment.

Binary and discrete variables must be declared as factors, and the argument data.dists must be a list with named arguments, one for each of the variables in data.df, where each entry is either "poisson","binomial", "multinomial" or "gaussian", see examples below. The "poisson" and "binomial" distributions use log and logit link functions respectively. Note that "binomial" here actually means only binary, one Bernoulli trial per row in data.df.

In the context of fitabn.mle adjustment means that irrespective to the adjacency matrix the adjustment variable set (adj.vars) will be add as covariate to every node defined by cor.vars. If cor.vars is NULL then adjustment is over all variables in the data.df.

In the network structure definition, dag.m, each row represents a node in the network, and the columns in each row define the parents for that particular node, see the example below for the specific format. The dag.m can be provided using a formula statement (similar to glm). A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3. The formula statement have to start with ~. In this example, node1 has two parents (parent1 and parent2). node2 and node3 have the same parent3. The parents names have to exactly match those given in data.df. : is the separator between either children or parents, | separates children (left side) and parents (right side), + separates terms, . replaces all the variables in data.df.

The Information-theoretic based network scores used in fitabn.mle are the maximum likelihood (mlik, called marginal likelihood in this context as it is computed node wise), the Akaike Information Criteria (aic), the Bayesian Information Criteria (bic) and the Minimum distance Length (mdl). The classical definitions of those metrics are given in Kratzer and Furrer (2018).

The numerical routine are based on an iterative scheme to estimate the regression coefficients. The Iterative Reweighed Least Square (IRLS) programmed using Rcpp/RcppArmadrillo. One hard coded feature of fitabn.mle is a conditional use of a bias reduced binomial regression when a classical glm fails to accurately estimates the maximum likelihood of the given model. Additionally a QR decomposition is performed to check for rank deficiency. If the model is rank deficient and the BR glm fails to estimate it, then predictors are sequentially removed. This feature aims at better estimating network scores when data sparsity is present.

A special care should be taken when intrepreting or even displaing p-values computed with fitabn.mle. Indeed, the full model is already selected using goodness of fit metrics based on the (same) full dataset.

References

Kratzer, G., Furrer, R., 2018. Information-Theoretic Scoring Rules to Learn Additive Bayesian Network Applied to Epidemiology. Preprint; Arxiv: stat.ML/1808.01126.

Further information about abn can be found at: http://www.r-bayesian-networks.org

Examples

Run this code

# NOT RUN {
## use built-in simulated data set

mydat <- ex0.dag.data[,c("b1","b2","b3","g1","b4","p2","p4")];## take a subset of cols

## setup distribution list for each node
mydists <- list(b1="binomial",
              b2="binomial",
              b3="binomial",
              g1="gaussian",
              b4="binomial",
              p2="poisson",
              p4="poisson"
             );
## null model - all independent variables
mydag.empty<-matrix(data=c(
                     0,0,0,0,0,0,0, # 
                     0,0,0,0,0,0,0, #
                     0,0,0,0,0,0,0, # 
                     0,0,0,0,0,0,0, # 
                     0,0,0,0,0,0,0, #
                     0,0,0,0,0,0,0, #
                     0,0,0,0,0,0,0  #
                     ), byrow=TRUE,ncol=7);
colnames(mydag.empty)<-rownames(mydag.empty)<-names(mydat);
## now repeat but include some dependencies
mydag<-mydag.empty;
mydag["b1","b2"]<-1; # b1<-b2 
mydag["b2","p4"]<-1; # b2<-p4
mydag["b2","g1"]<-1; # b2<-g1
mydag["g1","p2"]<-1; # g1<-p2
mydag["b3","g1"]<-1; # b3<-g1
mydag["b4","b1"]<-1; # b4<-b1
mydag["p4","g1"]<-1; # p4<-g1

## now fit the model to calculate its goodness of fit
myres.c <- fitabn.mle(dag.m=mydag,data.df=mydat,data.dists=mydists,centre=TRUE);

print(myres.c$mlik);

plotabn(dag.m = mydag,data.dists = mydists,fitted.values.abn.mle = myres.c$modes)
# }

Run the code above in your browser using DataLab