Learn R Programming

Laurae (version 0.0.0.9001)

lgbm.train: LightGBM Model Training

Description

This function allows you to train a LightGBM model. It is recommended to have your x_train and x_val sets as data.table, and to use the development data.table version. To install data.table development version, please run in your R console: install.packages("data.table", type = "source", repos = "http://Rdatatable.github.io/data.table"). The speed increase to create the train and test files can exceed 1,000x over write.table in certain cases. To store evaluation metrics throughout the training, you MUST run this function with verbose = FALSE.

Usage

lgbm.train(y_train, x_train, bias_train = NA, y_val = NA, x_val = NA,
  x_test = NA, SVMLight = is(x_train, "dgCMatrix"), data_has_label = TRUE,
  NA_value = "na", lgbm_path = "path/to/LightGBM.exe",
  workingdir = getwd(), train_name = paste0("lgbm_train", ifelse(SVMLight,
  ".svm", ".csv")), val_name = paste0("lgbm_val", ifelse(SVMLight, ".svm",
  ".csv")), test_name = paste0("lgbm_test", ifelse(SVMLight, ".svm", ".csv")),
  init_score = ifelse(is.na(bias_train), NA, paste(train_name, ".weight", sep
  = "")), files_exist = FALSE, save_binary = FALSE,
  train_conf = "lgbm_train.conf", pred_conf = "lgbm_pred.conf",
  test_conf = "lgbm_test.conf", validation = ifelse(is.na(y_val), FALSE,
  TRUE), predictions = FALSE, predict_leaf_index = FALSE,
  output_preds = "lgbm_predict_result.txt",
  test_preds = "lgbm_predict_test.txt", verbose = TRUE,
  log_name = "lgbm_log.txt", full_quiet = FALSE, full_console = FALSE,
  importance = FALSE, output_model = "lgbm_model.txt", input_model = NA,
  num_threads = 2, histogram_pool_size = -1, is_sparse = TRUE,
  two_round = FALSE, application = "regression", learning_rate = 0.1,
  num_iterations = 10, early_stopping_rounds = NA, num_leaves = 127,
  min_data_in_leaf = 100, min_sum_hessian_in_leaf = 10, max_bin = 255,
  feature_fraction = 1, feature_fraction_seed = 2, bagging_fraction = 1,
  bagging_freq = 0, bagging_seed = 3, is_sigmoid = TRUE, sigmoid = 1,
  is_unbalance = FALSE, max_position = 20, label_gain = c(0, 1, 3, 7, 15,
  31, 63), metric = "l2", metric_freq = 1, is_training_metric = FALSE,
  ndcg_at = c(1, 2, 3, 4, 5), tree_learner = "serial",
  is_pre_partition = FALSE, data_random_seed = 1, num_machines = 1,
  local_listen_port = 12400, time_out = 120, machine_list_file = "")

Arguments

y_train
Type: vector. The training labels.
x_train
Type: data.table (preferred), data.frame, or dgCMatrix (with SVMLight = TRUE). The training features. Not providing a data.frame results in at least 3x memory usage.
bias_train
Type: numeric or vector of numerics. The initial weights of the training data. If a numeric is provided, then the weights are identical for all the training samples. Otherwise, use the vector as weights. Defaults to NA.
y_val
Type: vector. The validation labels. Defaults to NA. Unused when validation is TRUE.
x_val
Type: data.table (preferred), data.frame, or dgCMatrix (with SVMLight = TRUE). The validation features. Defaults to NA. Unused when validation is TRUE.
x_test
Type: data.table (preferred), data.frame, or dgCMatrix (with SVMLight = TRUE). The testing features, if necessary.
SVMLight
Type: boolean. Whether the input is a dgCMatrix to be output to SVMLight format. Setting this to TRUE enforces you must provide labels separately (in y_train) and headers will be ignored. This is default behavior of SVMLight format. Defaults to is(x_train, "dgCMatrix").
data_has_label
Type: boolean. Whether the training and validation data have labels or not. Do not modify this. Defaults to TRUE.
NA_value
Type: numeric or character. What value replaces NAs. Use "na" if you want to specify "missing". It is not recommended to use something else, even by soemthing like a numeric value out of bounds (like -999 if all your values are greater than -999). You should change from the default "na" if they have a real numeric meaning. Defaults to "na".
lgbm_path
Type: character. Where is stored LightGBM? Include only the folder to it. Defaults to 'path/to/LightGBM.exe'.
workingdir
Type: character. The working directory used for LightGBM. Defaults to getwd().
train_name
Type: character. The name of the training data file for the model. Defaults to paste0('lgbm_train', ifelse(SVMLight, '.svm', '.csv')).
val_name
Type: character. The name of the testing data file for the model. Defaults to paste0('lgbm_val', ifelse(SVMLight, '.svm', '.csv')).
test_name
Type: character. The name of the testing data file for the model. Defaults to paste0('lgbm_test', ifelse(SVMLight, '.svm', '.csv')).
init_score
Type: string. The file name of initial (bias) training scores to start training LightGBM, which contains bias_train values. Defaults to ifelse(is.na(bias_train), NA, paste(train_name, ".weight", sep = "")), which means NA if bias_train is left default, else appends ".weight" extension to train_name name.
files_exist
Type: boolean. Whether the training (and testing) files are already existing. It overwrites files if there are any existing. Defaults to FALSE.
save_binary
Type: boolean. Whether data should be saved as binary files for faster load. The name takes automatically the name from the train_name and adds the extension ".bin". Defaults to FALSE.
train_conf
Type: character. The name of the training configuration file for the model. Defaults to 'lgbm_train.conf'.
pred_conf
Type: character. The name of the prediction configuration file for the model. Defaults to 'lgbm_pred.conf'.
test_conf
Type: character. The name of the testing prediction configuration file for the model. Defaults to 'lgbm_test.conf'.
validation
Type: boolean. Whether LightGBM performs validation during the training, by outputting metrics for the validation data. Defaults to ifelse(is.na(y_val), FALSE, TRUE), which means if y_val is the default value (unfilled), validation is FALSE else TRUE. Multi-validation data is not supported yet.
predictions
Type: boolean. Should LightGBM compute predictions after training the model? Defaults to FALSE.
predict_leaf_index
Type: boolean. When predictions is TRUE, should LightGBM predict leaf indexes? Defaults to FALSE. Largely recommended to keep it FALSE unless you know what you are doing.
output_preds
Type: character. The file name of the prediction results for the model. Defaults to 'lgbm_predict_result.txt'. Original name is output_result.
test_preds
Type: character. The file name of the prediction results for the model. Defaults to 'lgbm_predict_test.txt'.
verbose
Type: boolean/integer. Whether to print a lot of debug messages in the console or not. 0 is FALSE and 1 is TRUE. Defaults to TRUE. When set to FALSE, the model log is output to log_name which allows to get metric information from the log_name parameter!!!
log_name
Type: character. The logging (sink) file to output (like 'log.txt'). Defaults to 'lgbm_log.txt'.
full_quiet
Type: boolean. Whether file writing is quiet or not. When set to TRUE, the default printing is diverted to 'diverted_verbose.txt'. Combined with verbose = FALSE, the function is fully quiet. Defaults to FALSE.
full_console
Type: boolean. Whether a dedicated console should be visible. Defaults to FALSE.
importance
Type: boolean. Should LightGBM perform feature importance? Defaults to FALSE.
output_model
Type: character. The file name of output model. Defaults to 'lgbm_model.txt'.
input_model
Type: character. The file name of input model. If defined, LightGBM will resume training from that file. You MUST user a different output_model file name if you define input_model. Otherwise, you are overwriting your model (and if your model cannot learn by stopping immediately at the beginning, you would LOSE your model). Defaults to NA.
num_threads
Type: integer. The number of threads to run for LightGBM. It is recommended to not set it higher than the amount of physical cores in your computer. Defaults to 2. In virtualized environments, it can be better to set it to the maximum amount of threads allocated to the virtual machine (especially VirtualBox).
histogram_pool_size
Type: integer. The maximum cache size (in MB) allocated for LightGBM histogram sketching. Values below 0 (like -1) means no limit. Defaults to -1.
is_sparse
Type: boolean. Whether sparse optimization is enabled. Do not set this to FALSE unless you want to see your model being underperforming or if you know what you are going to do. Defaults to TRUE.
two_round
Type: boolean. LightGBM maps data file to memory and load features from memory to maximize speed. If the data is too large to fit in memory, use TRUE. Defaults to FALSE.
application
Type: character. The label application to learn. Must be either 'regression', 'binary', or 'lambdarank'. Defaults to 'regression'.
learning_rate
Type: numeric. The shrinkage rate applied to each iteration. Lower values lowers overfitting speed, while higher values increases overfitting speed. Defaults to 0.1.
num_iterations
Type: integer. The number of boosting iterations LightGBM will perform. Defaults to 10.
early_stopping_rounds
Type: integer. The number of boosting iterations whose validation metric is lower than the best is required for LightGBM to automatically stop. Defaults to NA.
num_leaves
Type: integer. The number of leaves in one tree. Roughly, a recommended value is n^2 - 1, n being the theoretical depth if each tree were identical. Lower values lowers tree complexity, while higher values increases tree complexity. Defaults to 127.
min_data_in_leaf
Type: integer. Minimum number of data in one leaf. Higher values potentially decrease overfitting. Defaults to 100.
min_sum_hessian_in_leaf
Type: numeric. Minimum sum of hessians in one leaf to allow a split. Higher values potentially decrease overfitting. Defaults to 10.0.
max_bin
Type: integer. The maximum number of bins created per feature. Lower values potentially decrease overfitting. Defaults to 255.
feature_fraction
Type: numeric (0, 1). Column subsampling percentage. For instance, 0.5 means selecting 50% of features randomly for each iteration. Lower values potentially decrease overfitting, while training faster. Defaults to 1.0.
feature_fraction_seed
Type: integer. Random starting seed for the column subsampling (feature_fraction). Defaults to 2.
bagging_fraction
Type: numeric (0, 1). Row subsampling percentage. For instance, 0.5 means selecting 50% of rows randomly for each iteration. Lower values potentially decrease overfitting, while training faster. Defaults to 1.0. Unused when bagging_freq is 0.
bagging_freq
Type: integer. The frequency of row subsampling (bagging_fraction). Lower values potentially decrease overfitting, while training faster. Defaults to 0.
bagging_seed
Type: integer. Random starting seed for the row subsampling (bagging_fraction). Defaults to 3.
is_sigmoid
Type: boolean. Whether to use a sigmoid transformation of raw predictions. Defaults to TRUE.
sigmoid
Type: numeric. "The sigmoid parameter". Defaults to 1.0.
is_unbalance
Type: boolean. For binary classification, setting this to TRUE might be useful when the training data is unbalanced. Defaults to FALSE.
max_position
Type: integer. For lambdarank, optimize NDCG for that specific value. Defaults to 20.
label_gain
Type: vector of integers. For lambdarank, relevant gain for labels. Defaults to c(0, 1, 3, 7, 15, 31, 63).
metric
Type: character, or vector of characters. The metric to optimize. There are 6 available: 'l1' (absolute loss), 'l2' (squared loss), 'ndcg' (NDCG), 'auc' (AUC), 'binary_logloss' (logarithmic loss), and 'binary_error' (accuracy). Defaults to 'l2'. Use a vector of characters to pass multiple metrics.
metric_freq
Type: integer. The frequency to report the metric(s). Defaults to 1.
is_training_metric
Type: boolean. Whether to report the training metric in addition to the validation metric. Defaults to FALSE.
ndcg_at
Type: vector of integers. Evaluate NDCG metric at these values. Defaults to c(1, 2, 3, 4, 5).
tree_learner
Type: character. The type of learner use, between 'serial' (single machine tree learner), 'feature' (feature parallel tree learner), 'data' (data parallel tree learner). Defaults to 'serial'.
is_pre_partition
Type: boolean. Whether data is pre-partitioned for parallel learning. Defaults to FALSE.
data_random_seed
Type: integer. Random starting seed for the parallel learner. Defaults to 1.
num_machines
Type: integer. When using parallel learning, the number of machines to use. Defaults to 1.
local_listen_port
Type: integer. The TCP listening port for the local machines. Allow this port in the firewall before training. 12400.
time_out
Type: integer. The socket time-out in minutes. Defaults to 120.
machine_list_file
Type: character. The file that contains the machine list for parallel learning. A line in that file much correspond to one IP and one port for one machine, separated by space instead of a colon (:). Defaults to ''.

Value

A list with the stored trained model (Model), the path (Path) of the trained model, the name (Name) of the trained model file, the LightGBM path (lgbm) which trained the model, the training file name (Train), the validation file name even if there were none provided (Valid), the testing file name even if there were none provided (Test), the validation predictions (Validation) if Predictions is set to TRUE with a validation set, the testing predictions (Testing) if Predictions is set to TRUE with a testing set, the name of the log file Log if verbose is set to FALSE, the log file content LogContent if verbose is set to FALSE, the metrics Metrics if verbose is set to FALSE, the best iteration (Best) if verbose is set to FALSE, the column names Columns if importance is set to TRUE, and the feature importance FeatureImp if importance is set to TRUE. Returns a character variable if LightGBM is not found under lgbm_path.

Details

The most important parameters are lgbm_path and workingdir: they setup where LightGBM is and where temporary files are going to be stored. lgbm_path is the full path to LightGBM executable, and includes the executable name and file extension (like C:/Laurae/LightGBM/windows/x64/Release/LightGBM.exe). workingdir is the working directory for the temporary files for LightGBM. It creates a lot of necessary files to make LightGBM work (defined by output_model, output_preds, train_conf, train_name, val_name, pred_conf). train_conf, train_name, and val_name defines respectively the configuration file name, the train file name, and the validation file name. They are created under this name when files_exist is set to FALSE. Once you filled these variables (and if they were appropriate), you should fill y_train, x_train. If you need model validation, fill also y_val, x_val. y is your label (a vector), while x is your data.table (preferred) or a data.frame or a matrix. Then, you are up to choose what you want, including hyperparameters to verbosity control. To get the metric table, you MUST use verbose = FALSE. It cannot be fetched without. sink() does not work. If for some reason you lose the ability to print in the console, run sink() in the console several times until you get an error.

Examples

Run this code
## Not run: ------------------------------------
# # Simple LightGBM model.
# 
# library(Laurae)
# library(stringi)
# library(Matrix)
# library(sparsity)
# library(data.table)
# 
# remove(list = ls()) # WARNING: CLEANS EVERYTHING IN THE ENVIRONMENT
# setwd("C:/LightGBM/temp") # DIRECTORY FOR TEMP FILES
# 
# DT <- data.table(Split1 = c(rep(0, 50), rep(1, 50)),
#                  Split2 = rep(c(rep(0, 25), rep(0.5, 25)), 2))
# DT$Split3 <- rep(c(rep(0, 10), rep(0.25, 15)), 4)
# DT$Split4 <- rep(c(rep(0, 5), rep(0.1, 5), rep(0, 5), rep(0.1, 10)), 4)
# DT$Split5 <- rep(c(rep(0, 5), rep(0.05, 5), rep(0, 10), rep(0.05, 5)), 4)
# label <- as.numeric((DT$Split2 == 0) & (DT$Split1 == 0) & (DT$Split3 == 0))
# 
# trained <- lgbm.train(y_train = label,
#                       x_train = DT,
#                       bias_train = NA,
#                       application = "binary",
#                       num_iterations = 1,
#                       early_stopping_rounds = 1,
#                       learning_rate = 5,
#                       num_leaves = 16,
#                       min_data_in_leaf = 1,
#                       min_sum_hessian_in_leaf = 1,
#                       tree_learner = "serial",
#                       num_threads = 1,
#                       lgbm_path = "C:/LightGBM/windows/x64/Release/lightgbm.exe",
#                       workingdir = getwd(),
#                       validation = FALSE,
#                       files_exist = FALSE,
#                       verbose = TRUE,
#                       is_training_metric = TRUE,
#                       save_binary = TRUE,
#                       metric = "binary_logloss")
## ---------------------------------------------

Run the code above in your browser using DataLab