Logic Regression is an adaptive regression methodology that attempts
to construct predictors as Boolean combinations of binary covariates.
In most regression problems a model is developed that only relates the
main effects (the predictors or transformations thereof) to the
response. Although interactions between predictors are considered
sometimes as well, those interactions are usually kept simple
(two- to three-way interactions at most). But often, especially when
all predictors are binary, the interaction between many predictors is
what causes the differences in response. This issue often arises in
the analysis of SNP microarray data or in data mining problems. Given
a set of binary predictors X, we try to create new, better predictors
for the response by considering combinations of those binary
predictors. For example, if the response is binary as well (which is
not required in general), we attempt to find decision rules such as
``if X1, X2, X3 and X4 are true'', or ``X5 or X6 but not X7 are
true'', then the response is more likely to be in class 0. In other
words, we try to find Boolean statements involving the binary
predictors that enhance the prediction for the response. In more
specific terms: Let X1,...,Xk be binary predictors, and let Y be a
response variable. We try to fit regression models of the form
g(E[Y]) = b0 + b1 L1+ ...+ bn Ln, where Lj is a Boolean expression of
the predictors X, such as Lj=[(X2 or X4c) and X7]. The above
framework includes many forms of regression, such as linear regression
(g(E[Y])=E[Y]) and logistic regression (g(E[Y])=log(E[Y]/(1-E[Y]))).
For every model type, we define a score function that reflects the
``quality'' of the model under consideration. For example, for linear
regression the score could be the residual sum of squares and for
logistic regression the score could be the deviance. We try to find
the Boolean expressions in the regression model that minimize the
scoring function associated with this model type, estimating the
parameters bj simultaneously with the Boolean expressions Lj. In
general, any type of model can be considered, as long as a scoring
function can be defined. For example, we also implemented the Cox
proportional hazards model, using the partial likelihood as the score.
Since the number of possible Logic Models we can construct for a given
set of predictors is huge, we have to rely on some search algorithms
to help us find the best scoring models. We define the move set by a
set of standard operations such as splitting and pruning the tree
(similar to the terminology introduced by Breiman et al (1984) for
CART). We investigated two types of algorithms: a greedy and a
simulated annealing algorithm. While the greedy algorithm is very
fast, it does not always find a good scoring model. The simulated
annealing algorithm usually does, but computationally it is more
expensive. Since we have to be certain to find good scoring models,
we usually carry out simulated annealing for our case studies.
However, as usual, the best scoring model generally over-fits the
data, and methods to separate signal and noise are needed. To assess
the over-fitting of large models, we offer the option to fit a model
of a specific size. For the model selection itself we developed and
implemented permutation tests and tests using cross-validation. If
sufficient data is available, an analysis using a training and a test
set can also be carried out. These tests are rather complicated, so
we will not go into detail here and refer you to Ruczinski I,
Kooperberg C, LeBlanc ML (2003), cited below.
There are two alternatives to the simulated annealing algorithm. One
is a stepwise greedy selection of models. This is when setting
select = 6
, and yields a sequence of models from size 1 through
a maximum size. At each time among all the models that are one larger
than the current model the best model is selected, yielding a sequence
of models of different sizes. Usually these models are not the best
possible, and, if the simulated annealing chain is long enough, you
should expect that the models selected using select = 2
are better.
The second alternative is to run a Markov Chain Monte Carlo (MCMC) algorithm.
This is what is done in Monte Carlo Logic Regression. The algorithm used
is a reversible jump MCMC algorithm, due to Green (1995). Other than
the length of the Markov chain, the only parameter that needs to be set
is a parameter for the geometric prior on model size. Other than in many
MCMC problems, the goal in Monte Carlo Logic Regression is not to yield
one single best predicting model, but rather to provide summaries of
all models. These are exactly the elements that are shown above as
the output when select = 7
.
MONITORING
The help file for logreg.anneal.control
, contains more
information on how to monitor the simulated annealing optimization for
logreg. Here is some general information.
Find the best scoring model of any size (select = 1)
During the iterations the following information is printed out:
log-temp | current score | best score | acc / | rej / | sing | current parameters |
-1.000 | 1.494 | 1.494 | 0( 0) | 0 | 0 | 2.88 -1.99 0.00 |
-1.120 | 1.150 | 1.043 | 655( 54) | 220 | 71 | 3.63 0.15 -1.82 |
-1.240 | 1.226 | 1.043 | 555( 49) | 316 | 80 | 3.83 0.05 -1.71 |
... | | | | | | |
-2.320 | 0.988 | 0.980 | 147( 36) | 759 | 58 | 3.00 -2.11 1.11 |
-2.440 | 0.982 | 0.980 | 25( 31) | 884 | 60 | 2.89 -2.12 1.24 |
-2.560 | 0.988 | 0.979 | 35( 61) | 850 | 51 | 3.00 -2.11 1.11 |
... | | | | | | |
-3.760 | 0.964 | 0.964 | 2( 22) | 961 | 15 | 2.57 -2.15 1.55 |
-3.880 | 0.964 | 0.964 | 0( 17) | 961 | 22 | 2.57 -2.15 1.55 |
-4.000 | 0.964 | 0.964 | 0( 13) | 970 | 17 | 2.57 -2.15 1.55 |
log-temp:
logarithm (base 10) of the temperature at the last iteration before the print out.
current score:
the score after the last iterations.
best score:
the single lowest score seen at any iteration.
acc:
the number of proposed moves that were accepted since the last print
out for which the model changed, within parenthesis, the number of
those that were identical in score to the move before acceptance.
rej:
the number of proposed moves that gave numerically acceptable results,
but were rejected by the simulated annealing algorithm since the last
print out.
sing:
the number of proposed moves that were rejected because they gave
numerically unacceptable results, for example because they yielded a
singular system.
current parameters:
the values of the coefficients (first for the intercept, then for the
linear (separate) components, then for the logic trees).
This information can be used to judge the convergence of the simulated
annealing algorithm, as described in the help file of
logreg.anneal.control
. Typically we want (i) the number of
acceptances to be high in the beginning, (ii) the number of
acceptances with different scores to be low at the end, and (iii) the
number of iterations when the fraction of acceptances is moderate to
be as large as possible.
Find the best scoring models for various sizes
(select = 2)
During the iterations the same information as for find the best
scoring model of any size, for each size model considered.
Carry out cross-validation for model selection
(select = 3)
Information about the simulated annealing as described above can be
printed out. Otherwise, during the cross-validation process
information like
Step 5 of 10 [ 2 trees; 4 leaves] The CV score is 1.127 1.120 1.052 1.122
The first of the four scores is the training-set score on the
current validation sample, the second score is the average of all the
training-set scores that have been processed for this model
size, the third is the test-set score on the current
validation sample, and the fourth score is the average of all the
test-set scores that have been processed for this model size.
Typically we would prefer the model with the lowest test-set score
average over all cross-validation steps.
Carry out a permutation test to check for signal in the data
(select = 4)
Information about the simulated annealing as described above can be
printed out. Otherwise, first the score of the model of size 0
(typically only fitting an intercept) and the score of the best model
are printed out. Then during the permutation lines like
Permutation number 21 out of 50 has score 1.47777
are printed. Each score is the result of fitting a logic tree model,
on data where the response has been permuted. Typically we would
believe that there is signal in the data if most permutations have
worse (higher) scores than the best model, but we may believe that
there is substantial over-fitting going on if these permutation scores
are much better (lower) than the score of the model of size 0.
Carry out a permutation test for model selection
(select = 5)
To be able to run this option, an object of class logreg
that
was run with (select = 2)
needs to be in place. Information
about the simulated annealing as described above can be printed
out. Otherwise, lines like
Permutation number 8 out of 25 has score 1.00767 model size 3 with 2 tree(s)
are printed. We can compare these scores to the tree of the same size
and the best tree. If the scores are about the same as the one for
the best tree, we think that the ``true'' model size may be the one
that is being tested, while if the scores are much worse, the true
model size may be larger. The comparison with the model of the same
size suggests us again how much over-fitting may be going on.
plot.logreg
generates informative histograms.
Greedy stepwise selection of Logic Regression models
(select = 6)
The scores of the best greedy models of each size are printed.
Monte Carlo Logic Regression
(select = 7)
A status line is printed every so many iterations. This information
is probably not very useful, other than that it helps you figure out how
far the code is.
PARAMETERS
As Logic Regression is written in Fortran 77 some parameters had to be
hard coded in. The default values of these parameters are
maximum number of columns in the input file: 1000
maximum number of leaves in a logic tree: 128
maximum number of logic trees: 5
maximum number of separate parameters: 50
maximum number of total parameters(separate + trees): 55
If these parameters are not large enough (an error message will let you
know this), you need to reset them in slogic.f and recompile.
In particular, the statements defining these parameters are
PARAMETER (LGCn2MAX = 1000)
PARAMETER (LGCnknMAX = 128)
PARAMETER (LGCntrMAX = 5)
PARAMETER (LGCnsepMAX = 50)
PARAMETER (LGCbetaMAX = 55)
The unfortunate thing is that you will have to change these parameter
definitions several times in the code. So search until you have found
them all.