opt_bbfit: Batchwise Backfitting

Description

Batchwise backfitting estimation engine for GAMLSS using very large data sets.

Usage

## Batchwise backfitting engine.
opt_bbfit(x, y, family, shuffle = TRUE, start = NULL, offset = NULL,
  epochs = 1, nbatch = 10, verbose = TRUE, ...)
bbfit(x, y, family, shuffle = TRUE, start = NULL, offset = NULL,
  epochs = 1, nbatch = 10, verbose = TRUE, ...)
## Parallel version.
opt_bbfitp(x, y, family, mc.cores = 1, ...)
## Loglik contribution plot.
contribplot(x, ...)

Value

For function opt_bbfit() a list containing the following objects:

fitted.values: A named list of the fitted values of the modeled parameters of the selected distribution.
parameters: The estimated set regression coefficients and smoothing variances.
shuffle: Logical
runtime: The runtime of the algorithm.

Arguments

x: For function bfit() the x list, as returned from function bamlss.frame, holding all model matrices and other information that is used for fitting the model. For the updating functions an object as returned from function smooth.construct or smoothCon. For function contribplot(), a "bamlss" object using bbfit() with argument select = TRUE.
y: The model response, as returned from function bamlss.frame.
family: A bamlss family object, see family.bamlss.
shuffle: Should observations be shuffled?
start: A named numeric vector containing possible starting values, the names are based on function parameters.
offset: Can be used to supply model offsets for use in fitting, returned from function bamlss.frame.
epochs: For how many epochs should the algorithm run?
nbatch: Number of batches. Can also be a number between 0 and 1, i.e., determining the fraction of observations that should be used for fitting.
verbose: Print information during runtime of the algorithm.
mc.cores: On how many cores should estimation be started?
...: For bbfitp() all arguments to be passed to bbfit().

Details

The algorithm uses batch-wise estimation of regression coefficients and smoothing variances. The smoothing variances are estimated on an hold-out batch. This way, models for very large data sets can be estimated. Note, the algorithm can work in combination with the ff and ffbase package, i.e., the entire data is never in the computer RAM. Therefore, the data can either to be stored as comma separated file on disc or provided as "ffdf" data frame, see also the examples.

The optimizer functions use additional arguments:

batch_ids. This argument can either be a list of indices specifying the batches that should be used for estimation, or a vector of length 2, where the first element specifies the number of observations that should be sampled for each batch and the second argument specifies the number of batches, see the example.
nu, the step length control parameter. Defaults to nu = 0.05. If argument slice = TRUE then nu = 1.
loglik, defaults to loglik = FALSE. If set to loglik = TRUE the "out-of-sample" log-likelihood is used for smoothing variance estimation.
aic, defaults to aic = FALSE, If set to aic = TRUE the "out-of-sample" AIC is used for smoothing variance estimation.
eps_loglik, defaults to eps_loglik = 0.01. This argument specifies the relative change in the "out-of-sample" log-likelihood that is needed such that a model term gets updated.
select, defaults to select = FALSE. If set to select = TRUE, the algorithm only selects the model term with the largest contribution in the "out-of-sample" log-likelihood for updating in each iteration/batch.
always, defaults to always = FALSE. If set to always = TRUE no log-likelihood contribution checks will be used and model terms are always updated.
K, defaults to K = 2. This argument controls the penalty on the degrees of freedom in the computation of the AIC.
slice, defaults to slice = FALSE. If set to slice = TRUE, slice sampling using the "out-of-sample" log-likelihood or AIC is used for smoothing variance estimation. Moreover, always = TRUE, eps_loglik = -Inf and nu = 1. If slice is an integer n, slice sampling is started after n iterations, before smoothing variances are optimized.

When using function opt_bbfitp, the parameter updates are stored as "mcmc" objects. In this case the traceplots can be visualized using plot.bamlss.

Examples

Run this code

if (FALSE) ## Simulate data.
set.seed(123)
d <- GAMart(n = 27000, sd = -1)

## Write data to disc.
tf <- tempdir()
write.table(d, file.path(tf, "d.raw"), quote = FALSE, row.names = FALSE, sep = ",")

## Model formula.
f <- list(
  y ~ s(x1,k=40) + s(x2,k=40) + s(x3,k=40) + te(lon,lat,k=10),
  sigma ~ s(x1,k=40) + s(x2,k=40) + s(x3,k=40) + te(lon,lat,k=10)
)

## Specify 50 batches with 1000 observations.
batch_ids <- c("nobs" = 1000, "nbatch" = 50)

## Note, can also be a list of indices, e.g.
## batch_ids <- lapply(1:50, function(i) { sample(1:nrow(d), size = 1000) })

## Different flavors:
## (1) Using "out-of-sample" aic for smoothing
##     variance estimation. Update is only accepted
##     if the "out-of-sample" log-likelihood is
##     increased. If data is a filepath, the data set is
##     read into R using package ff and model and
##     design matrices are processed with ff. This may
##     take some time depending on the size of the data.
set.seed(1)
b1 <- bamlss(f, data = file.path(tf, "d.raw"),
  sampler = FALSE, optimizer = opt_bbfit,
  batch_ids = batch_ids, nu = 0.1, aic = TRUE, eps_loglik = -Inf,
  always = FALSE)

## Plot estimated effects.
## plot(b1)

## Plot coefficient paths for x3 in mu.
## pathplot(b1, name = "mu.s.s(x3).b")

## (2) Same but always update, this mimics the classic SGD.
##     Note, for prediction only the last iteration is
##     used in this case. To use more iterations use opt_bbfitp(),
##     Then iterations are stored as "mcmc" object and we can
##     predict using the burnin argment, e.g.,
##     p <- predict(b2, model = "mu", burnin = 35)
set.seed(2)
b2 <- bamlss(f, data = file.path(tf, "d.raw"),
  sampler = FALSE, optimizer = opt_bbfit,
  batch_ids = batch_ids, nu = 0.1, aic = TRUE, eps_loglik = -Inf,
  always = TRUE)

## Plot coefficient paths for x3 in mu.
## pathplot(b2, name = "mu.s.s(x3).b")

## (3) Boosting type flavor, only update model term with
##     the largest contribution in the "out-of-sample"
##     log-likelihood. In this case, if edf = 0 during
##     runtime of the algorithm, no model has an additional
##     contribution and the algorithm converges. This
##     behavior is controlled by argument eps_loglik, the
##     higher eps_loglik, the more restrictive is the
##     updating step.

## Initialize intercepts.
set.seed(0)

batch_ids <- lapply(1:400, function(i) { sample(1:nrow(d), size = 1000) })

b0 <- bamlss(y ~ 1, data = d, sampler = FALSE, optimizer = opt_bbfitp,
  batch_ids = batch_ids)

## Compute starting values, remove the first
## 10 iterates and compute the mean of the
## remaining iterates.
start <- coef(b0, FUN = mean, burnin = 200)

## Start boosting, only update if change in
## "out-of-sample" log-likelihood is 0.1
## eps_loglik = 0.001.
b3 <- bamlss(f, data = d, sampler = FALSE, optimizer = opt_bbfit,
    batch_ids = batch_ids, nu = 0.1, aic = TRUE, eps_loglik = 0.001,
    select = TRUE, always = FALSE, start = start)

## Plot log-likelihood contributions.
## contribplot(b3)
## In this case, the algorithm did not converge,
## we need more iterations/batches.

## Note, prediction uses last iterate.
p3 <- predict(b3, model = "mu")

## (4) Use slice sampling under the "out-of-sample"
##     log likelihood for estimation of smoothing
##     variances. In this case model terms are always
##     updated ad parameter paths behave like a MCMC
##     chain. Therefore, use opt_bbfitp(), which stores
##     parameter paths as "mcmc" objects and we can
##     inspect using traceplots. Note nu = 1 if
##     slice = TRUE.
set.seed(4)
b4 <- bamlss(f, data = d, sampler = FALSE, optimizer = opt_bbfitp,
  batch_ids = batch_ids, aic = TRUE, slice = TRUE)

## plot(b4)

## Plot parameter updates/samples.
## plot(b4, which = "samples")

## Predict with burnin and compute mean
## prediction of the last 20 iterates.
p4 <- predict(b4, model = "mu", burnin = 30, FUN = mean)