install.packages("data.table", type = "source", repos = "http://Rdatatable.github.io/data.table")
.
The speed increase to create the train and test files can exceed 1,000x over write.table in certain cases.
To store evaluation metrics throughout the training, you MUST run this function with verbose = FALSE
.lgbm.train(y_train, x_train, bias_train = NA, y_val = NA, x_val = NA,
x_test = NA, SVMLight = is(x_train, "dgCMatrix"), data_has_label = TRUE,
NA_value = "na", lgbm_path = "path/to/LightGBM.exe",
workingdir = getwd(), train_name = paste0("lgbm_train", ifelse(SVMLight,
".svm", ".csv")), val_name = paste0("lgbm_val", ifelse(SVMLight, ".svm",
".csv")), test_name = paste0("lgbm_test", ifelse(SVMLight, ".svm", ".csv")),
init_score = ifelse(is.na(bias_train), NA, paste(train_name, ".weight", sep
= "")), files_exist = FALSE, save_binary = FALSE,
train_conf = "lgbm_train.conf", pred_conf = "lgbm_pred.conf",
test_conf = "lgbm_test.conf", validation = ifelse(is.na(y_val), FALSE,
TRUE), predictions = FALSE, predict_leaf_index = FALSE,
output_preds = "lgbm_predict_result.txt",
test_preds = "lgbm_predict_test.txt", verbose = TRUE,
log_name = "lgbm_log.txt", full_quiet = FALSE, full_console = FALSE,
importance = FALSE, output_model = "lgbm_model.txt", input_model = NA,
num_threads = 2, histogram_pool_size = -1, is_sparse = TRUE,
two_round = FALSE, application = "regression", learning_rate = 0.1,
num_iterations = 10, early_stopping_rounds = NA, num_leaves = 127,
min_data_in_leaf = 100, min_sum_hessian_in_leaf = 10, max_bin = 255,
feature_fraction = 1, feature_fraction_seed = 2, bagging_fraction = 1,
bagging_freq = 0, bagging_seed = 3, is_sigmoid = TRUE, sigmoid = 1,
is_unbalance = FALSE, max_position = 20, label_gain = c(0, 1, 3, 7, 15,
31, 63), metric = "l2", metric_freq = 1, is_training_metric = FALSE,
ndcg_at = c(1, 2, 3, 4, 5), tree_learner = "serial",
is_pre_partition = FALSE, data_random_seed = 1, num_machines = 1,
local_listen_port = 12400, time_out = 120, machine_list_file = "")
SVMLight = TRUE
). The training features. Not providing a data.frame results in at least 3x memory usage.NA
.NA
. Unused when validation
is TRUE
.SVMLight = TRUE
). The validation features. Defaults to NA
. Unused when validation
is TRUE
.SVMLight = TRUE
). The testing features, if necessary.TRUE
enforces you must provide labels separately (in y_train
) and headers will be ignored. This is default behavior of SVMLight format. Defaults to is(x_train, "dgCMatrix")
.TRUE
."na"
if you want to specify "missing". It is not recommended to use something else, even by soemthing like a numeric value out of bounds (like -999
if all your values are greater than -999
). You should change from the default "na"
if they have a real numeric meaning. Defaults to "na"
.'path/to/LightGBM.exe'
.getwd()
.paste0('lgbm_train', ifelse(SVMLight, '.svm', '.csv'))
.paste0('lgbm_val', ifelse(SVMLight, '.svm', '.csv'))
.paste0('lgbm_test', ifelse(SVMLight, '.svm', '.csv'))
.bias_train
values. Defaults to ifelse(is.na(bias_train), NA, paste(train_name, ".weight", sep = ""))
, which means NA
if bias_train
is left default, else appends ".weight"
extension to train_name
name.FALSE
.train_name
and adds the extension ".bin"
. Defaults to FALSE
.'lgbm_train.conf'
.'lgbm_pred.conf'
.'lgbm_test.conf'
.ifelse(is.na(y_val), FALSE, TRUE)
, which means if y_val
is the default value (unfilled), validation
is FALSE
else TRUE
. Multi-validation data is not supported yet.FALSE
.predictions
is TRUE
, should LightGBM predict leaf indexes? Defaults to FALSE
. Largely recommended to keep it FALSE
unless you know what you are doing.'lgbm_predict_result.txt'
. Original name is output_result
.'lgbm_predict_test.txt'
.TRUE
. When set to FALSE
, the model log is output to log_name
which allows to get metric information from the log_name
parameter!!!'lgbm_log.txt'
.TRUE
, the default printing is diverted to 'diverted_verbose.txt'
. Combined with verbose = FALSE
, the function is fully quiet. Defaults to FALSE
.FALSE
.FALSE
.'lgbm_model.txt'
.output_model
file name if you define input_model
. Otherwise, you are overwriting your model (and if your model cannot learn by stopping immediately at the beginning, you would LOSE your model). Defaults to NA
.2
. In virtualized environments, it can be better to set it to the maximum amount of threads allocated to the virtual machine (especially VirtualBox).0
(like -1
) means no limit. Defaults to -1
.FALSE
unless you want to see your model being underperforming or if you know what you are going to do. Defaults to TRUE
.FALSE
.'regression'
, 'binary'
, or 'lambdarank'
. Defaults to 'regression'
.0.1
.10
.NA
.n^2 - 1
, n
being the theoretical depth if each tree were identical. Lower values lowers tree complexity, while higher values increases tree complexity. Defaults to 127
.100
.10.0
.255
.1.0
.feature_fraction
). Defaults to 2
.1.0
. Unused when bagging_freq
is 0
.bagging_fraction
). Lower values potentially decrease overfitting, while training faster. Defaults to 0
.bagging_fraction
). Defaults to 3
.TRUE
.1.0
.FALSE
.20
.c(0, 1, 3, 7, 15, 31, 63)
.'l1'
(absolute loss), 'l2'
(squared loss), 'ndcg'
(NDCG), 'auc'
(AUC), 'binary_logloss'
(logarithmic loss), and 'binary_error'
(accuracy). Defaults to 'l2'
. Use a vector of characters to pass multiple metrics.1
.FALSE
.c(1, 2, 3, 4, 5)
.'serial'
(single machine tree learner), 'feature'
(feature parallel tree learner), 'data'
(data parallel tree learner). Defaults to 'serial'
.FALSE
.1
.1
.12400
.120
.:
). Defaults to ''
.Model
), the path (Path
) of the trained model, the name (Name
) of the trained model file, the LightGBM path (lgbm
) which trained the model, the training file name (Train
), the validation file name even if there were none provided (Valid
), the testing file name even if there were none provided (Test
), the validation predictions (Validation
) if Predictions
is set to TRUE
with a validation set, the testing predictions (Testing
) if Predictions
is set to TRUE
with a testing set, the name of the log file Log
if verbose
is set to FALSE
, the log file content LogContent
if verbose
is set to FALSE
, the metrics Metrics
if verbose
is set to FALSE
, the best iteration (Best
) if verbose
is set to FALSE
, the column names Columns
if importance
is set to TRUE
, and the feature importance FeatureImp
if importance
is set to TRUE
. Returns a character variable if LightGBM is not found under lgbm_path.lgbm_path
and workingdir
: they setup where LightGBM is and where temporary files are going to be stored. lgbm_path
is the full path to LightGBM executable, and includes the executable name and file extension (like C:/Laurae/LightGBM/windows/x64/Release/LightGBM.exe
). workingdir
is the working directory for the temporary files for LightGBM. It creates a lot of necessary files to make LightGBM work (defined by output_model, output_preds, train_conf, train_name, val_name, pred_conf
). train_conf
, train_name
, and val_name
defines respectively the configuration file name, the train file name, and the validation file name. They are created under this name when files_exist
is set to FALSE
. Once you filled these variables (and if they were appropriate), you should fill y_train, x_train
. If you need model validation, fill also y_val, x_val
. y is your label (a vector), while x is your data.table (preferred) or a data.frame or a matrix. Then, you are up to choose what you want, including hyperparameters to verbosity control. To get the metric table, you MUST use verbose = FALSE
. It cannot be fetched without. sink()
does not work. If for some reason you lose the ability to print in the console, run sink()
in the console several times until you get an error.## Not run: ------------------------------------
# # Simple LightGBM model.
#
# library(Laurae)
# library(stringi)
# library(Matrix)
# library(sparsity)
# library(data.table)
#
# remove(list = ls()) # WARNING: CLEANS EVERYTHING IN THE ENVIRONMENT
# setwd("C:/LightGBM/temp") # DIRECTORY FOR TEMP FILES
#
# DT <- data.table(Split1 = c(rep(0, 50), rep(1, 50)),
# Split2 = rep(c(rep(0, 25), rep(0.5, 25)), 2))
# DT$Split3 <- rep(c(rep(0, 10), rep(0.25, 15)), 4)
# DT$Split4 <- rep(c(rep(0, 5), rep(0.1, 5), rep(0, 5), rep(0.1, 10)), 4)
# DT$Split5 <- rep(c(rep(0, 5), rep(0.05, 5), rep(0, 10), rep(0.05, 5)), 4)
# label <- as.numeric((DT$Split2 == 0) & (DT$Split1 == 0) & (DT$Split3 == 0))
#
# trained <- lgbm.train(y_train = label,
# x_train = DT,
# bias_train = NA,
# application = "binary",
# num_iterations = 1,
# early_stopping_rounds = 1,
# learning_rate = 5,
# num_leaves = 16,
# min_data_in_leaf = 1,
# min_sum_hessian_in_leaf = 1,
# tree_learner = "serial",
# num_threads = 1,
# lgbm_path = "C:/LightGBM/windows/x64/Release/lightgbm.exe",
# workingdir = getwd(),
# validation = FALSE,
# files_exist = FALSE,
# verbose = TRUE,
# is_training_metric = TRUE,
# save_binary = TRUE,
# metric = "binary_logloss")
## ---------------------------------------------
Run the code above in your browser using DataLab