xgb.opt.depth: xgboost depth automated optimizer

Description

This function allows you to optimize the depth of xgboost in gbtree/dart booster given the other parameters constant. Output is intentionally pushed to the global environment, specifically in Laurae.xgb.opt.depth.df, Laurae.xgb.opt.depth.iter, and Laurae.xgb.opt.depth.best to allow manual interruption without losing data. Verbosity is automatic and cannot be removed. In case you need this function without verbosity, please compile the package after removing verbose messages. In addition, a sink is forced. Make sure to run sink() if you interrupt (or if xgboost interrupts) prematurely the execution of the function. Otherwise, you end up with no more messages printed to your R console. initial = 8, min_depth = 1, max_depth = 25, patience = 2, sd_effect = 0.001, worst_score = 0, learner = NA, better = max_better

Usage

xgb.opt.depth(initial = 8, min_depth = 1, max_depth = 25, patience = 2,
  sd_effect = 0.001, worst_score = 0, learner = NA, better = max_better)

Arguments

initial

The initial starting search depth. This is the starting point, along with initial - 2 and initial + 2 depths. Defaults to 8.

min_depth

The minimum accepted depth. If it is reached, the computation stops. Defaults to 1.

max_depth

The maximum accepted depth. If it is reached, the computation stops. Defaults to 25.

patience

How many iterations are allowed without improvement, excluding the initialization (the three first computations). Larger means more patience before stopping due to no improvement of the scored metric. Defaults to 2.

sd_effect

How much the standard deviation accounts in the score to determine the best depth parameter. Default to 0.001.

worst_score

The worst possible score of the metric used, as a numeric (non NA / Infinite) value. Defaults to 0.

learner

The learner function. It fetches everything needed from the global environment. Defaults to my_learner, which is an example of using that function.

better

Should we optimize for the minimum or the maximum value of the performance? Defaults to max_better for maximization of the scored metric. Use min_better for the minimization of the scored metric.

Value

Three elements forced in the global environment: "Laurae.xgb.opt.depth.df" for the dataframe with depth log (data.frame), "Laurae.xgb.opt.depth.iter" for the dataframe with iteration log (list), and "Laurae.xgb.opt.depth.best" for a length 1 vector with the best depth found (numeric).

Examples

Run this code

#Please check xgb.opt.utils.R file in GitHub.
## Not run: ------------------------------------
# 
# 
# max_better <- function(cp) {
#   return(max(cp, na.rm = TRUE))
# }
# 
# my_learner <- function(depth) {
#   sink(file = "Laurae/log.txt", append = TRUE, split = FALSE)
#   cat("\n\n\nDepth ", depth, "\n\n", sep = "")
#   global_depth <<- depth
#   gc()
#   set.seed(11111)
#   temp_model <- xgb.cv(data = dtrain,
#                        nthread = 12,
#                        folds = folded,
#                        nrounds = 100000,
#                        max_depth = depth,
#                        eta = 0.05,
#                        #gamma = 0.1,
#                        subsample = 1.0,
#                        colsample_bytree = 1.0,
#                        booster = "gbtree",
#                        #eval_metric = "auc",
#                        eval_metric = mcc_eval_nofail_cv,
#                        maximize = TRUE,
#                        early_stopping_rounds = 25,
#                        objective = "binary:logistic",
#                        verbose = TRUE
#                        #base_score = 0.005811208
#   )
#   sink()
#   i <<- 0
#   return(c(temp_model$evaluation_log[[4]][temp_model$best_iteration],
#   temp_model$evaluation_log[[5]][temp_model$best_iteration], temp_model$best_iteration))
# }
# 
# xgb.opt.depth.callback <- function(i, learner, better, sd_effect) {
#   cat("\nExploring depth ", sprintf("%02d", Laurae.xgb.opt.depth.iter[i, "Depth"]), ": ")
#   Laurae.xgb.opt.depth.df[Laurae.xgb.opt.depth.iter[i, "Depth"],
#   c("mean", "sd", "nrounds")] <<- learner(Laurae.xgb.opt.depth.iter[i, "Depth"])
#   Laurae.xgb.opt.depth.df[Laurae.xgb.opt.depth.iter[i, "Depth"],
#   "score"] <<- Laurae.xgb.opt.depth.df[Laurae.xgb.opt.depth.iter[i, "Depth"], "mean"] +
#   (Laurae.xgb.opt.depth.df[Laurae.xgb.opt.depth.iter[i, "Depth"], "sd"] * sd_effect)
#   Laurae.xgb.opt.depth.iter[i,
#   "Score"] <<- Laurae.xgb.opt.depth.df[Laurae.xgb.opt.depth.iter[i, "Depth"], "score"]
#   Laurae.xgb.opt.depth.iter[i,
#   "Best"] <<- better(Laurae.xgb.opt.depth.df[, "score"])
#   Laurae.xgb.opt.depth.best <<- which(
#   Laurae.xgb.opt.depth.df[, "score"] == Laurae.xgb.opt.depth.iter[i, "Best"])[1]
#   cat("[",
#       sprintf("%05d", Laurae.xgb.opt.depth.df[Laurae.xgb.opt.depth.iter[i, "Depth"], "nrounds"]),
#       "] ",
#       sprintf("%.08f", Laurae.xgb.opt.depth.df[Laurae.xgb.opt.depth.iter[i, "Depth"], "mean"]),
#       ifelse(is.na(Laurae.xgb.opt.depth.df[Laurae.xgb.opt.depth.iter[i, "Depth"], "mean"]) == TRUE,
#       "",
#       paste("+",
#       sprintf("%.08f", Laurae.xgb.opt.depth.df[Laurae.xgb.opt.depth.iter[i, "Depth"], "sd"]),
#       sep = "")),
#       " (Score: ",
#       sprintf("%.08f", Laurae.xgb.opt.depth.df[Laurae.xgb.opt.depth.iter[i, "Depth"], "score"]),
#       ifelse(Laurae.xgb.opt.depth.iter[i, "Best"] == Laurae.xgb.opt.depth.iter[i, "Score"],
#       " <<<)",
#       "    )"),
#       " - best is: ",
#       Laurae.xgb.opt.depth.best,
#       " - ",
#       format(Sys.time(), "%a %b %d %Y %X"),
#       sep = "")
# }
# 
# xgb.opt.depth(initial = 10, min_depth = 1, max_depth = 20, patience = 2, sd_effect = 0,
# worst_score = 0, learner = my_learner, better = max_better)
# 
## ---------------------------------------------

Run the code above in your browser using DataLab