Lextravagenza: Laurae's Extravagenza machine learning model

Description

This function is a machine learning model using dynamic depth with xgboost but ignores the gradient boosting enhancements of xgboost. It outperforms xgboost in nearly every scenario where the number of boosting iterations is small. When the number of boosting iterations is large (like: 100), this model has worse performance than typical gradient boosted tree implementations. This does not work on multiclass problems.

Usage

Lextravagenza(train, valid, test, maximize = FALSE, personal_rounds = 100,
  personal_depth = 1:10, personal_eta = 0.2, auto_stop = 10,
  base_margin = 0.5, seed = 0, ...)

Arguments

train

Type: xgb.DMatrix. The training data. It will be used for training the models.

valid

Type: xgb.DMatrix. The validation data. It will be used for selecting the model depth per iteration to assess generalization.

test

Type: xgb.DMatrix. The testing data. It will be used for early stopping.

maximize

Type: boolean. Whether to maximize or minimize the loss function. Defaults to FALSE.

personal_rounds

Type: integer. The number of separate boosting iterations. Defaults to 100.

personal_depth

Type: vector of integers. The possible depth values during boosting of trees. Defaults to 1:10, which means a depth between 1 and 10, therefore c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10).

personal_eta

Type: numeric. The shrinkage (learning rate). Lower values mean lower overfitting. Defaults to 0.20.

auto_stop

Type: integer. The early stopping value. When the metric does not improve for auto_stop iterations, the training is interrupted and the model returned to the user. Defaults to 10.

base_margin

Type: numeric. The base prediction value. For binary classification, it is recommended to be the number of label 1 divided by the number of observations, although it is not mandatory. Defaults to 0.5.

seed

Type: integer. Random seed used for training. Defaults to 0.

...

Other arguments to pass to xgb.train. Examples: nthread = 1, eta = 0.4...

Value

A list with the model (model), the parameters (eta, base_margin), the best training iteration for generalization (best_iter), the depth evolution over the number of iterations (depth_tree), the validation score (valid_loss), and the test score (test_loss).

Details

Dynamic depth allows to train dynamic boosted trees that fits better the data early, thus overfitting quickly the data. As it uses a validation set as feedback during training, it is necessary to have a second validation set (test set), an uncommon scenario in machine learning. The Extravagenza model does not leverage the properties of gradient and hessian to optimize the learning appropriately, hence overfitting faster without using any knowledge of previous trainings (but the last tree only). Do not use this method when you attempt to predict large trees, as not being able to use the previous gradients/hessians leads to a poor generalization (but still better than most non-ensemble models). Usually, a xgboost model needing only 75 iterations will require 200 iterations for the Extravagenza machine learning model to (potentially) outperform the initial xgboost model. For example, on House Prices data set using RMSE, you can try to beat xgboost: In addition, you will need to use the latest xgboost repo (from pull request 1964 at least) if you want to train without spamming the console (verbose = 0 used to not record metric!).

Examples

Run this code

## Not run: ------------------------------------
# library(Laurae)
# library(xgboost)
# data(agaricus.train, package='xgboost')
# data(agaricus.test, package='xgboost')
# dtrain <- xgb.DMatrix(agaricus.train$data[1:5000, ], label = agaricus.train$label[1:5000])
# dval <- xgb.DMatrix(agaricus.train$data[5001:6513, ], label = agaricus.train$label[5001:6513])
# dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
# Lex_model <- Lextravagenza(train = dtrain, # Train data
#                            valid = dval, # Validation data = depth tuner
#                            test = dtest, # Test data = early stopper
#                            maximize = FALSE, # Not maximizing RMSE
#                            personal_rounds = 50, # Boosting for 50 iterations
#                            personal_depth = 1:8, # Dynamic depth between 1 and 8
#                            personal_eta = 0.40, # Shrinkage of boosting to 0.40
#                            auto_stop = 5, # Early stopping of 5 iterations
#                            base_margin = 0.5, # Start with 0.5 probabilities
#                            seed = 0, # Random seed
#                            nthread = 1, # 1 thread for training
#                            eta = 0.40, # xgboost shrinkage of 0.40 (avoid fast overfit)
#                            booster = "gbtree", # train trees, can't work with GLM
#                            objective = "binary:logistic", # classification, binary
#                            eval_metric = "rmse" # RMSE metric to optimize
# )
# 
# str(Lex_model, max.level = 1) # Get list of the model structure
# 
# predictedValues <- pred.Lextravagenza(Lex_model, dtest, nrounds = Lex_model$best_iter)
# all.equal(sqrt(mean((predictedValues - agaricus.test$label)^2)),
#           Lex_model$test[Lex_model$best_iter])
# 
# # Get depth evolution vs number of boosting iterations
# plot(x = 1:length(Lex_model$depth),
#      y = Lex_model$depth,
#      main = "Depth vs iterations",
#      xlab = "Iterations",
#      ylab = "Depth")
# 
# # Get validation evolution vs number of boosting iterations
# plot(x = 1:length(Lex_model$valid),
#      y = Lex_model$valid,
#      main = "Validation loss vs iterations",
#      xlab = "Iterations",
#      ylab = "Validation loss")
# 
# # Get testing evolution vs number of boosting iterations
# plot(x = 1:length(Lex_model$test),
#      y = Lex_model$test,
#      main = "Testing loss vs iterations",
#      xlab = "Iterations",
#      ylab = "Testing loss")
## ---------------------------------------------

Run the code above in your browser using DataLab