This function performs Bayesian variable selection and effect fusion for categorical predictors in linear and logistic regression models. Effect fusion aims at the question which categories of an ordinal or nominal predictor have a similar effect on the response and therefore can be fused to obtain a sparser representation of the model. Effect fusion and variable selection can be obtained either with a prior that has an interpretation as spike and slab prior on the level effect differences or with a sparse finite mixture prior on the level effects. The regression coefficients are estimated with a flat uninformative prior after model selection or by using model averaged results. For posterior inference, a MCMC sampling scheme is used that involves only Gibbs sampling steps. The sampling schemes for linear and logistic regression are almost identical as in the case of logistic regression a data augmentation strategy (Polson et al. (2013)) is used that requires only one additional step to sample from the Polya-Gamma distribution.
effectFusion(
y,
X,
types,
method,
prior = list(),
mcmc = list(),
mcmcRefit = list(),
family = "gaussian",
modelSelection = "binder",
returnBurnin = FALSE
)
a vector of the response observations (continuous if family =
'gaussian'
and 0/1 if family =
'binomial'
)
a data frame with covariates, each column representing one covariate. Ordinal and nominal covariates should
be of class factor
a character vector to specify the type of each covariate; 'c' indicates continuous or metric predictors, 'o' ordinal predictors and 'n' nominal predictors
controls the main prior structure that is used for effect fusion. For the prior
that has an interpretation as spike and slab prior on the level effect differences choose method =
'SpikeSlab'
and for the sparse finite mixture prior on the level effects choose method =
'FinMix'
. See
details for a description of the two approaches and their advantages and drawbacks. For comparison purposes
it is also possible to fit a full model without performing any effect fusion (method =
NULL
)
an (optional) list of prior settings and hyper-parameters for the prior (see details).
The specification of this list depends on the chosen method
and the selected family
an (optional) list of MCMC sampling options (see details)
an (optional) list of MCMC sampling options for the refit of the selected model (see details)
indicates whether linear (default, family =
'gaussian'
) or logistic regression (family =
'binomial'
)
should be performed
if modelSelection =
'binder'
the final model is selected by minimising the expected posterior binder's loss
using an algorithm of Lau and Green (2008) for the spike and slab model and an algorithm of Rastelli and Friel (2016)
for the finite mixture approach. Alternatively, modelSelection =
'pam'
can be specified for the sparse finite mixture
model. In that case, the final model is selected by using pam clustering and the silhouette coefficient (see Malsiner-Walli et al., 2018 for details).
If modelSelection
is NULL
no final model selection is performed and parameter estimates
are model averaged results. modelSelection =
'binder'
is the default value. modelSelection =
'pam'
is only available for method =
'FinMix'
. If method =
'SpikeSlab'
and modelSelection =
'pam'
,
modelSelection
is automatically set to 'binder'
. For the finite mixture approach
we recommend to use modelSelection =
'binder'
, as this algorithm provides - in contrast to pam clustering and the
the silhouette coefficient - the opportunity to exclude whole covariates.
The function returns an object of class fusion
with methods dic
,
model
, print
, summary
and plot
.
An object of class fusion
is a named list containing the following elements:
fit
beta
regression coefficients \(\beta_0\) (intercept) and \(\beta\)
delta
indicator variable \(\delta\) for slab component when method =
'SpikeSlab'
. The differences of
the level effects are assigned either to the spike (delta = 0
) or the slab component (delta = 1
). If an
effect difference is assigned to the spike component, the difference is almost zero and the corresponding level effects
should be fused.
tau2
variance \(\tau^2\) of slab component when method =
'SpikeSlab'
. If no hyperprior on \(\tau^2\) is specified, tau2
contains the fixed values for \(\tau^2\).
S
latent allocation variable \(S\) for mixture components when method =
'FinMix'
eta
mixture component weights \(\eta\) when method =
'FinMix'
eta0
weights of components located at zero \(\eta_0\) when method =
'FinMix'
mu
mixture component means \(\mu\) when method =
'FinMix'
sgma2
error variance \(\sigma^2\) of the model (only for family =
'gaussian'
)
fit_burnin
refit
beta
sgma2
X_dummy_fused
model
method
see arguments
family
see arguments
data
model
a named list containing information on the full, initial model
categories
number of categories for categorical predictors
diff
number of pairwise level effect differences
n_cont
number of metric predictors
n_ord
number of ordinal predictors
n_nom
number of nominal predictors
prior
see details for prior
mcmc
see details for mcmc
mcmcRefit
see details for mcmcRefit
modelSelection
see arguments
returnBurnin
see arguments
numbCoef
number of estimated regression coefficients (based on the refitted model if effect fusion and final model selection is performed, otherwise based on model averaged results or the full model, respectively)
call
function call
This function provides identification of categories (of ordinal and nominal predictors) with the same effect on the response and their automatic fusion.
Two different prior versions for effect fusion and variable selection are available. The first prior version allows a priori for almost perfect as well as almost zero dependence between level effects. This prior has also an interpretation as independent spike and slab prior on all pairwise differences of level effects and correction for the linear dependence of the effect differences. Even though the prior is mainly designed for fusion of level effects, excluding some categories from the model as well as the whole covariate (variable selection) can also be easily accomplished. Excluding a category from the model corresponds to fusion of this category to the baseline and excluding the whole covariate consequently to fusion of all categories to the baseline.
The second prior is a modification of the usual spike and slab prior for the regression coefficients by combining
a spike at zero with a finite location mixture of normal components. It enables detection of categories with similar
effects on the response by clustering the regression effects. Categories with effects that are allocated to the same
cluster are fused. Due to the specification with one component located at zero also automatic exclusion of
levels and whole covariates without any effect on the outcome is provided. However, when using modelSelection =
'pam'
,
it is not possible to exclude whole covariates as the Silhouette coefficient does not allow for one cluster solutions. Therefore,
we recommend to use modelSelection =
'binder'
.
In settings with large numbers of categories, we recommend to use the sparse finite mixture prior for computational reasons. It is important to note that the sparse finite mixture prior on the level effects does not take into account the ordering information of ordinal predictors and treats them like nominal predictors, whereas in the spike and slab case fusion is restricted to adjacent categories for ordinal predictors.
Metric predictors can be included in the model as well and variable selection will be performed also for these predictors.
If modelSelection =
NULL
, no final model selection is performed and model averaged results are returned. When modelSelection =
'binder'
the final
model is selected by minimizing the expected posterior Binder loss for each covariate separately.
Additionally, there is a second option for the finite mixture prior (modelSelection =
'pam'
) which performs model selection
by identifying the optimal partition of the effects using PAM clustering and the silhouette coefficient.
For comparison purposes it is also possible to fit a full model instead of performing effect fusion (method =
NULL
).
All other functions provided in this package, such as dic
or summary
, do also work for the full model.
Details for the model specification (see arguments):
prior
r
variance ratio of slab to spike component; default to 50000 if family =
'gaussian'
and
5000000 if family =
'binomial'
. r
should be chosen not too small but still small enough to avoid stickiness of MCMC. We
recommend a value of at least 20000.
g0
shape parameter of inverse gamma distribution on \(\tau^2\) when tau2_fix =
NULL
and method =
'SpikeSlab'
; default
to 5. The default value is a standard choice in variable selection where the tails of spike and slab component are not too thin
to cause mixing problems in MCMC.
G0
scale parameter of inverse gamma distribution on \(\tau^2\) when tau2_fix =
NULL
and method =
'SpikeSlab'
; default
to 25. G0
controls to some extend the sparsity of the model. Smaller values for G0
help to detect also small
level effect differences, but result in less fusion of categories.
tau2_fix
If tau2_fix =
NULL
, an inverse gamma hyper-prior is specified on \(\tau^2\).
However, the value of the slab variance can also be fixed for each covariate instead of using a hyperprior.
tau2_fix
is only of interest if method =
'SpikeSlab'
.
Default to NULL
. Similar to the scale parameter G0
, the fixed variance of the slab
component tau2_fix
can control to some extend the sparsity.
e0
parameter of Dirichlet hyper-prior on mixture weights when method =
'FinMix'
;
default to 0.01. e0
should be chosen smaller than 1 in order to encourage empty components. Small values such as
0.01 help to concentrate the model space on sparse solutions.
p
prior parameter to control mixture component variances when method =
'FinMix'
; values between 100 and 100000
led to good results in simulation studies; default to 100 if family =
'gaussian'
and 1000 if family =
'binomial'
. We recommend to try different values for this
prior parameter and compare the models using dic
. When hyperprior =
FALSE
, larger values of p
lead to less sparsity and it should be chosen
not smaller than 100. If a hyper-prior on the mixture component variances is used (hyperprior =
TRUE
), p
has almost no effect on the sparsity of the model and it should again be not smaller than 100.
hyperprior
logical value if inverse gamma hyper-prior on component variance should be specified when method =
'FinMix'
;
default to FALSE. The hyper-prior leads to robust results concerning the specification of p
but also to very sparse solutions.
s0
hyper-parameter (shape) of inverse gamma distribution on error variance, used for both
versions of method
, but only for family =
'gaussian'
; default to 0.
S0
hyper-parameter (scale) of inverse gamma distribution on error variance, used for both
versions of method
, but only for family =
'gaussian'
; default to 0.
mcmc
M
number of MCMC iterations after the burn-in phase; default to 20000 for effect fusion models and 3000 for full models.
burnin
number of MCMC iterations discarded as burn-in; default to 5000 for effect fusion models and 1000 for full models.
startsel
number of MCMC iterations drawn from the model without performing effect fusion; default to 1000 for effect fusion models and 0 for full models.
mcmcRefit
M_refit
number of MCMC iterations after the burn-in phase for the refit of the selected model; default to 3000.
burnin_refit
number of MCMC iterations discarded as burn-in for the refit of the selected model; default to 1000.
Pauger, D., and Wagner, H. (2019). Bayesian Effect Fusion for Categorical Predictors. Bayesian Analysis, 14(2), 341-369. 10.1214/18-BA1096
Malsiner-Walli, G., Pauger, D., and Wagner, H. (2018). Effect Fusion Using Model-Based Clustering. Statistical Modelling, 18(2), 175-196. 10.1177/1471082X17739058
Polson, N.G., Scott, J.G., and Windle, J. (2013). Bayesian Inference for Logistic Models Using Polya-Gamma Latent Variables. Journal of the American Statistical Association, 108(504), 1339-1349. 10.1080/01621459.2013.829001
# NOT RUN {
# ----------- Load simulated data set 'sim1' for linear regression
data(sim1)
y = sim1$y
X = sim1$X
types = sim1$types
# ----------- Bayesian effect fusion for simulated data set with spike and slab prior
m1 <- effectFusion(y, X, types, method = 'SpikeSlab')
# print, summarize and plot results
print(m1)
summary(m1)
plot(m1)
# evaluate model and model criteria
model(m1)
dic(m1)
# ----------- Use finite mixture prior for comparison
m2 <- effectFusion(y, X, types, method = 'FinMix')
# summarize and plot results
print(m2)
summary(m2)
plot(m2)
model(m2)
dic(m2)
# change prior parameter specification
m3 <- effectFusion(y, X, types, prior= list(p = 10^3), method = 'FinMix')
plot(m3)
# ------------ Use model averaged coefficient estimates
m4 <- effectFusion(y, X, types, method = 'SpikeSlab', modelSelection = NULL)
summary(m4)
# ------------ Estimate full model for comparison purposes
m5 <- effectFusion(y, X, types, method = NULL)
summary(m5)
plot(m5)
dic(m5)
# ----------- Load simulated data set 'sim3' for logistic regression
data(sim3)
y = sim3$y
X = sim3$X
types = sim3$types
# ----------- Bayesian effect fusion for simulated data set with finite mixture prior
m6 <- effectFusion(y, X, types, method = 'FinMix', prior = list(p = 10^4), family = 'binomial')
# look at the results
print(m6)
summary(m6)
plot(m6)
model(m6)
dic(m6)
# ----------- Use spike and slab prior for comparison
m7 <- effectFusion(y, X, types, method = 'SpikeSlab', family = 'binomial', returnBurnin = TRUE)
# summarize and evaluate results
print(m7)
summary(m7)
plot(m7)
model(m7)
dic(m7)
# }
Run the code above in your browser using DataLab