install.packages("data.table", type = "source", repos = "http://Rdatatable.github.io/data.table")
.
The speed increase to create the train and test files can exceed 1,000x over write.table in certain cases.
To store evaluation metrics throughout the training, you MUST run this function with verbose = FALSE
.lgbm.cv(y_train, x_train, bias_train = NA, x_test = NA,
SVMLight = is(x_train, "dgCMatrix"), data_has_label = TRUE,
NA_value = "nan", lgbm_path = "path/to/LightGBM.exe",
workingdir = getwd(), train_name = paste0("lgbm_train", ifelse(SVMLight,
".svm", ".csv")), val_name = paste0("lgbm_val", ifelse(SVMLight, ".svm",
".csv")), test_name = paste0("lgbm_test", ifelse(SVMLight, ".svm", ".csv")),
init_score = ifelse(is.na(bias_train), NA, paste(train_name, ".weight", sep
= "")), files_exist = FALSE, save_binary = FALSE,
train_conf = "lgbm_train.conf", pred_conf = "lgbm_pred.conf",
test_conf = "lgbm_test.conf", validation = TRUE, unicity = FALSE,
folds = 5, folds_weight = NA, stratified = TRUE, fold_seed = 0,
fold_cleaning = 50, predictions = TRUE, predict_leaf_index = FALSE,
separate_val = TRUE, separate_tests = TRUE,
output_preds = "lgbm_predict.txt", test_preds = "lgbm_predict_test.txt",
verbose = TRUE, log_name = "lgbm_log.txt", full_quiet = FALSE,
full_console = FALSE, importance = FALSE,
output_model = "lgbm_model.txt", input_model = NA, num_threads = 2,
histogram_pool_size = -1, is_sparse = TRUE, two_round = FALSE,
application = "regression", learning_rate = 0.1, num_iterations = 10,
early_stopping_rounds = NA, num_leaves = 127, min_data_in_leaf = 100,
min_sum_hessian_in_leaf = 10, max_bin = 255, feature_fraction = 1,
feature_fraction_seed = 2, bagging_fraction = 1, bagging_freq = 0,
bagging_seed = 3, is_sigmoid = TRUE, sigmoid = 1,
is_unbalance = FALSE, max_position = 20, label_gain = c(0, 1, 3, 7, 15,
31, 63), metric = "l2", metric_freq = 1, is_training_metric = FALSE,
ndcg_at = c(1, 2, 3, 4, 5), tree_learner = "serial",
is_pre_partition = FALSE, data_random_seed = 1, num_machines = 1,
local_listen_port = 12400, time_out = 120, machine_list_file = "")
SVMLight = TRUE
). The training features. Not providing a data.frame or a matrix results in at least 3x memory usage.NA
.SVMLight = TRUE
). The testing features, if necessary. Not providing a data.frame or a matrix results in at least 3x memory usage. Defaults to NA
. Predictions are averaged. Must be unlabeled.TRUE
enforces you must provide labels separately (in y_train
) and headers will be ignored. This is default behavior of SVMLight format. Defaults to is(x_train, "dgCMatrix")
.TRUE
."na"
if you want to specify "missing". It is not recommended to use something else, even by soemthing like a numeric value out of bounds (like -999
if all your values are greater than -999
). You should change from the default "na"
if they have a real numeric meaning. Defaults to "na"
.'path/to/LightGBM.exe'
.getwd()
.paste0('lgbm_train', ifelse(SVMLight, '.svm', '.csv'))
.paste0('lgbm_val', ifelse(SVMLight, '.svm', '.csv'))
.paste0('lgbm_test', ifelse(SVMLight, '.svm', '.csv'))
.bias_train
values. Defaults to ifelse(is.na(bias_train), NA, paste(train_name, ".weight", sep = ""))
, which means NA
if bias_train
is left default, else appends ".weight"
extension to train_name
name.FALSE
.train_name
and adds the extension ".bin"
. Defaults to FALSE
.,'lgbm_train.conf'
.'lgbm_pred.conf'
.'lgbm_test.conf'
.TRUE
. Multi-validation data is not supported yet.TRUE
.folds
-fold cross-validation. If a vector of two integers is supplied, performs a folds[1]
-fold cross-validation repeated folds[2]
times. If a vector of integers (larger than 2) was provided, each integer value should refer to the fold, of the same length of the training data. Otherwise (if a list was provided), each element of the list must refer to a fold and they will be treated sequentially. Defaults to 5
.NA
), the weights are automatically set to rep(1/length(folds))
for an average (does not mix well with folds with different sizes). When the folds are automatically created by supplying fold
a vector of two integers, then the weights are automatically computed. Defaults to NA
.TRUE
.n
. Otherwise (if an integer is supplied), it starts each fold with the provided seed, and adds 1 to the seed for every repeat. Defaults to 0
.50
.TRUE
.predictions
is TRUE
, should LightGBM predict leaf indexes? Defaults to FALSE
. It is nearly mandatory to keep it FALSE
unless you know what you are doing, as then you should use separate_folds
to nto have a mix of non sense predictions.TRUE
.TRUE
.'lgbm_predict.txt'
. Original name is output_result
.'lgbm_predict_test.txt'
.TRUE
. When set to FALSE
, the model log is output to log_name
which allows to get metric information from the log_name
parameter!!!'lgbm_log.txt'
.TRUE
, the default printing is diverted to 'diverted_verbose.txt'
. Combined with verbose = FALSE
, the function is fully quiet. Defaults to FALSE
.FALSE
.FALSE
.'lgbm_model.txt'
.output_model
file name if you define input_model
. Otherwise, you are overwriting your model (and if your model cannot learn by stopping immediately at the beginning, you would LOSE your model). If defined, LightGBM will resume training from that file. Defaults to NA
. Unused yet.2
. In virtualized environments, it can be better to set it to the maximum amount of threads allocated to the virtual machine (especially VirtualBox).0
(like -1
) means no limit. Defaults to -1
.FALSE
unless you want to see your model being underperforming or if you know what you are going to do. Defaults to TRUE
.FALSE
.'regression'
, 'binary'
, or 'lambdarank'
. Defaults to 'regression'
.0.1
.10
.NA
.n^2 - 1
, n
being the theoretical depth if each tree were identical. Lower values lowers tree complexity, while higher values increases tree complexity. Defaults to 127
.100
.10.0
.255
.1.0
.feature_fraction
). Defaults to 2
.1.0
. Unused when bagging_freq
is 0
.bagging_fraction
). Lower values potentially decrease overfitting, while training faster. Defaults to 0
.bagging_fraction
). Defaults to 3
.TRUE
.1.0
.FALSE
.20
.c(0, 1, 3, 7, 15, 31, 63)
.'l1'
(absolute loss), 'l2'
(squared loss), 'ndcg'
(NDCG), 'auc'
(AUC), 'binary_logloss'
(logarithmic loss), and 'binary_error'
(accuracy). Defaults to 'l2'
. Use a vector of characters to pass multiple metrics.1
.FALSE
.c(1, 2, 3, 4, 5)
.'serial'
(single machine tree learner), 'feature'
(feature parallel tree learner), 'data'
(data parallel tree learner). Defaults to 'serial'
.FALSE
.1
.1
.12400
.120
.:
). Defaults to ''
.Validation
are provided if predictions
is set to TRUE
, and weighted averaged testing predictions Testing
are provided if predictions
is set to TRUE
with a testing set, and weights Weights
if predictions
is set to TRUE
. Also, aggregated feature importance is provided if importance
is set to TRUE
.lgbm_path
and workingdir
: they setup where LightGBM is and where temporary files are going to be stored. lgbm_path
is the full path to LightGBM executable, and includes the executable name and file extension (like C:/Laurae/LightGBM/windows/x64/Release/LightGBM.exe
). workingdir
is the working directory for the temporary files for LightGBM. It creates a lot of necessary files to make LightGBM work (defined by output_model, output_preds, train_conf, train_name, val_name, pred_conf
). train_conf
, train_name
, and val_name
defines respectively the configuration file name, the train file name, and the validation file name. They are created under this name when files_exist
is set to FALSE
. unicity
defines whether to create separate files (if TRUE
) or to save space by writing over the same file (if FALSE
). Predicting does not work with FALSE
. Files are taking the names you provided (or the default ones) while adding a "_X" to the file name before the file extension if unicity = FALSE
. Once you filled these variables (and if they were appropriate), you should fill y_train, x_train
. If you need model validation, fill also y_val, x_val
. y is your label (a vector), while x is your data.table (preferred) or a data.frame or a matrix. Then, you are up to choose what you want, including hyperparameters to verbosity control. To get the metric tables, you MUST use verbose = FALSE
. It cannot be fetched without. sink()
does not work. If for some reason you lose the ability to print in the console, run sink()
in the console several times until you get an error.## Not run: ------------------------------------
# #5-fold cross-validated LightGBM, on very simple data.
#
# library(Laurae)
# library(stringi)
# library(Matrix)
# library(sparsity)
# library(data.table)
#
# remove(list = ls()) # WARNING: CLEANS EVERYTHING IN THE ENVIRONMENT
# setwd("C:/LightGBM/temp") # DIRECTORY FOR TEMP FILES
#
# DT <- data.table(Split1 = c(rep(0, 50), rep(1, 50)),
# Split2 = rep(c(rep(0, 25), rep(0.5, 25)), 2))
# DT$Split3 <- rep(c(rep(0, 10), rep(0.25, 15)), 4)
# DT$Split4 <- rep(c(rep(0, 5), rep(0.1, 5), rep(0, 5), rep(0.1, 10)), 4)
# DT$Split5 <- rep(c(rep(0, 5), rep(0.05, 5), rep(0, 10), rep(0.05, 5)), 4)
# label <- as.numeric((DT$Split2 == 0) & (DT$Split1 == 0) & (DT$Split3 == 0))
#
# trained <- lgbm.cv(y_train = label,
# x_train = DT,
# bias_train = NA,
# folds = 5,
# unicity = TRUE,
# application = "binary",
# num_iterations = 1,
# early_stopping_rounds = 1,
# learning_rate = 5,
# num_leaves = 16,
# min_data_in_leaf = 1,
# min_sum_hessian_in_leaf = 1,
# tree_learner = "serial",
# num_threads = 1,
# lgbm_path = "C:/LightGBM/windows/x64/Release/lightgbm.exe",
# workingdir = getwd(),
# validation = FALSE,
# files_exist = FALSE,
# verbose = TRUE,
# is_training_metric = TRUE,
# save_binary = TRUE,
# metric = "binary_logloss")
#
# str(trained)
## ---------------------------------------------
Run the code above in your browser using DataLab