Learn R Programming

vcrpart (version 1.0-6)

tvcglm: Coefficient-wise tree-based varying coefficient regression based on generalized linear models

Description

The tvcglm function implements the tree-based varying coefficient regression algorithm for generalized linear models introduced by Burgin and Ritschard (2017). The algorithm approximates varying coefficients by piecewise constant functions using recursive partitioning, i.e., it estimates the selected coefficients individually by strata of the value space of partitioning variables. The special feature of the provided algorithm is that it allows building for each varying coefficient an individual partition, which enhances the possibilities for model specification and to select partitioning variables individually by coefficient.

Usage

tvcglm(formula, data, family, 
       weights, subset, offset, na.action = na.omit, 
       control = tvcglm_control(), ...)

tvcglm_control(minsize = 30, mindev = 2.0, maxnomsplit = 5, maxordsplit = 9, maxnumsplit = 9, cv = TRUE, folds = folds_control("kfold", 5), prune = cv, fast = TRUE, center = fast, maxstep = 1e3, verbose = FALSE, ...)

Value

An object of class tvcm

Arguments

formula

a symbolic description of the model to fit, e.g.,

y ~ vc(z1, z2, z3) + vc(z1, z2, by = x1) + vc(z2, z3, by = x2)

where the vc terms specify the varying fixed coefficients. The unnamed arguments within vc terms are interpreted as partitioning variables (i.e., moderators). The by argument specifies the associated predictor variable. If no such predictor variable is specified (e.g., see the first term in the above example formula), the vc term is interpreted as a varying intercept, i.e., an nonparametric estimate of the direct effect of the partitioning variables. For details, see vcrpart-formula. Note that the global intercept may be removed by a -1 term, according to the desired interpretation of the model.

family

the model family. An object of class family.

data

a data frame containing the variables in the model.

weights

an optional numeric vector of weights to be used in the fitting process.

subset

an optional logical or integer vector specifying a subset of 'data' to be used in the fitting process.

offset

this can be used to specify an a priori known component to be included in the linear predictor during fitting.

na.action

a function that indicates what should happen if data contain NAs. The default na.action = na.omit is listwise deletion, i.e., observations with missings on any variable are dropped. See na.action.

control

a list with control parameters as returned by tvcglm_control, or by tvcm_control for advanced users.

minsize

numeric (vector). The minimum sum of weights in terminal nodes.

mindev

numeric scalar. The minimum permitted training error reduction a split must exhibit to be considered of a new split. The main role of this parameter is to save computing time by early stopping. May be set lower for very few partitioning variables resp. higher for many partitioning variables.

maxnomsplit, maxordsplit, maxnumsplit

integer scalars for split candidate reduction. See tvcm_control

cv

logical scalar. Whether or not the cp parameter should be cross-validated. If TRUE cvloss is called.

folds

a list of parameters to create folds as produced by folds_control. Is used for cross-validation.

prune

logical scalar. Whether or not the initial tree should be pruned by the estimated cp parameter from cross-validation. Cannot be TRUE if cv = FALSE.

fast

logical scalar. Whether the approximative model should be used to search for the next split. The approximative search model uses only the observations of the node to split and incorporates the fitted values of the current model as offsets. Therewith the estimation is reduces to the coefficients of the added split. If FALSE, the accurate search model is used.

center

logical integer. Whether the predictor variables of update models during the grid search should be centered. Note that TRUE will not modify the predictors of the fitted model.

maxstep

integer. The maximum number of iterations i.e. number of splits to be processed.

verbose

logical. Should information about the fitting process be printed to the screen?

...

additional arguments passed to the fitting function fit or to tvcm_control.

Author

Reto Burgin

Details

tvcglm processes two stages. The first stage, called partitioning stage, builds overly fine partitions for each vc term; the second stage, called pruning stage, selects the best-sized partitions by collapsing inner nodes. For details on the pruning stage, see tvcm-assessment. The partitioning stage iterates the following steps:

  1. Fit the current generalized linear model

    y ~ NodeA:x1 + ... + NodeK:xK

    with glm, where Nodek is a categorical variable with terminal node labels for the \(k\)-th varying coefficient.

  2. Search the globally best split among the candidate splits by an exhaustive -2 likelihood training error search that cycles through all possible splits.

  3. If the -2 likelihood training error reduction of the best split is smaller than mindev or there is no candidate split satisfying the minimum node size minsize, stop the algorithm.

  4. Else incorporate the best split and repeat the procedure.

The partitioning stage selects, in each iteration, the split that maximizes the -2 likelihood training error reduction, compared to the current model. The default stopping parameters are minsize = 30 (a minimum node size of 30) and mindev = 2 (the training error reduction of the best split must be larger than two to continue).

The algorithm implements a number of split point reduction methods to decrease the computational complexity. See the arguments maxnomsplit, maxordsplit and maxnumsplit.

The algorithm can be seen as an extension of CART (Breiman et. al., 1984) and PartReg (Wang and Hastie, 2014), with the new feature that partitioning can be processed coefficient-wise.

References

Breiman, L., J. H. Friedman, R. A. Olshen and C.J. Stone (1984). Classification and Regression Trees. New York, USA: Wadsworth.

Wang, J. C., Hastie, T. (2014), Boosted Varying-Coefficient Regression Models for Product Demand Prediction, Journal of Computational and Graphical Statistics, 23(2), 361-382.

Burgin, R. and G. Ritschard (2017), Coefficient-Wise Tree-Based Varying Coefficient Regression with vcrpart. Journal of Statistical Software, 80(6), 1--33.

See Also

tvcm_control, tvcm-methods, tvcm-plot, tvcm-plot, tvcm-assessment, fvcglm, glm

Examples

Run this code
## ------------------------------------------------------------------- #  
## Example: Moderated effect of education on poverty
##
## The algorithm is used to find out whether the effect of high
## education 'EduHigh' on poverty 'Poor' is moderated by the civil
## status 'CivStat'. We specify two 'vc' terms in the logistic
## regression model for 'Poor': a first that accounts for the direct
## effect of 'CivStat' and a second that accounts for the moderation of
## 'CivStat' on the relation between 'EduHigh' and 'Poor'. We use here
## the 2-stage procedure with a partitioning- and a pruning stage as
## described in Burgin and Ritschard (2017). 
## ------------------------------------------------------------------- #

data(poverty)
poverty$EduHigh <- 1 * (poverty$Edu == "high")

## fit the model
model.Pov <-
  tvcglm(Poor ~ -1 +  vc(CivStat) + vc(CivStat, by = EduHigh) + NChild, 
         family = binomial(), data = poverty, subset = 1:200,
         control = tvcm_control(verbose = TRUE, papply = lapply,
           folds = folds_control(K = 1, type = "subsampling", seed = 7)))

## diagnosis
plot(model.Pov, "cv")
plot(model.Pov, "coef")
summary(model.Pov)
splitpath(model.Pov, steps = 1:3)
prunepath(model.Pov, steps = 1)

Run the code above in your browser using DataLab