This function returns multiple proxies for estimating the connection strength of the edges of a possibly discretized Bayesian network's dataset. The retuned connection strength measures are: the Raw Mutual Information (mi.raw
), the Percentage Mutual information (mi.raw.pc
), the Raw Mutual Information computed via correlation (mi.corr
), the link strength (ls
), the percentage link strength (ls.pc
) and the statistical distance (stat.dist
).
The general concept of entropy is defined for probability distributions. The probability is estimated from data using frequency tables. Then the estimates are plug-in in the definition of the entropy to return the so-called empirical entropy. A standard known problem of empirical entropy is that the estimations are biased due to the sampling noise. This is also known that the bias will decrease as the sample size increases.
The mutual information estimation is computed from the observed frequencies through a plug-in estimator based on entropy. For the case of an arc going from the node X to the node Y and the remaining set of parent of Y is denoted as Z.
The mutual information is defined as I(X, Y) = H(X) + H(Y) - H(X, Y), where H() is the entropy.
The Percentage Mutual information is defined as PI(X,Y) = I(X,Y)/H(Y|Z).
The Mutual Information computed via correlation is defined as MI(X,Y) = -0.5 log(1-cor(X,Y)^2).
The link strength is defined as LS(X->Y) = H(Y|Z)-H(Y|X,Z).
The percentage link strength is defined as PLS(X->Y) = LS(X->Y) / H(Y|Z).
The statistical distance is defined as SD(X,Y) = 1- MI(X,Y) / max(H(X),H(Y)).