lgbm.cv.prep: LightGBM Cross-Validated Model Preparation

Description

This function allows you to prepare the cross-validatation of a LightGBM model. It is recommended to have your x_train and x_val sets as data.table (or data.frame), and the data.table development version. To install data.table development version, please run in your R console: install.packages("data.table", type = "source", repos = "http://Rdatatable.github.io/data.table"). SVMLight conversion requires Laurae's sparsity package, which can be installed using devtools:::install_github("Laurae2/sparsity"). SVMLight format extension used is .svm. Does not handle weights or groups.

Usage

lgbm.cv.prep(y_train, x_train, x_test = NA, SVMLight = is(x_train,
  "dgCMatrix"), data_has_label = FALSE, NA_value = "nan",
  workingdir = getwd(), train_all = FALSE, test_all = FALSE,
  cv_all = TRUE, train_name = paste0("lgbm_train", ifelse(SVMLight, ".svm",
  ".csv")), val_name = paste0("lgbm_val", ifelse(SVMLight, ".svm", ".csv")),
  test_name = paste0("lgbm_test", ifelse(SVMLight, ".svm", ".csv")),
  verbose = TRUE, folds = 5, folds_weight = NA, stratified = TRUE,
  fold_seed = 0, fold_cleaning = 50)

Arguments

y_train

Type: vector. The training labels.

x_train

Type: data.table or dgCMatrix (with SVMLight = TRUE). The training features.

x_test

Type: data.table or dgCMatrix (with SVMLight = TRUE). The testing features, if necessary. Not providing a data.frame or a matrix results in at least 3x memory usage. Defaults to NA.

SVMLight

Type: boolean. Whether the input is a dgCMatrix to be output to SVMLight format. Setting this to TRUE enforces you must provide labels separately (in y_train) and headers will be ignored. This is default behavior of SVMLight format. Defaults to FALSE.

data_has_label

Type: boolean. Whether the data has labels or not. Do not modify this. Defaults to FALSE.

NA_value

Type: numeric or character. What value replaces NAs. Use "na" if you want to specify "missing". It is not recommended to use something else, even by soemthing like a numeric value out of bounds (like -999 if all your values are greater than -999). You should change from the default "na" if they have a real numeric meaning. Defaults to "na".

workingdir

Type: character. The working directory used for LightGBM. Defaults to getwd().

train_all

Type: boolean. Whether the full train data should be exported to the requested format for usage with lgbm.train. Defaults to FALSE.

test_all

Type: boolean. Whether the full test data should be exported to the requested format for usage with lgbm.train. Defaults to FALSE.

cv_all

Type: boolean. Whether the full cross-validation data should be exported to the requested format for usage with lgbm.cv. Defaults to TRUE.

train_name

Type: character. The name of the default training data file for the model. Defaults to paste0('lgbm_train', ifelse(SVMLight, '.svm', '.csv')).

val_name

Type: character. The name of the default validation data file for the model. Defaults to paste0('lgbm_val', ifelse(SVMLight, '.svm', '.csv')).

test_name

Type: character. The name of the testing data file for the model. Defaults to paste0('lgbm_test', ifelse(SVMLight, '.svm', '.csv')).

verbose

Type: boolean. Whether fwrite data is output. Defaults to TRUE.

folds

Type: integer, vector of two integers, vector of integers, or list. If a integer is supplied, performs a folds-fold cross-validation. If a vector of two integers is supplied, performs a folds[1]-fold cross-validation repeated folds[2] times. If a vector of integers (larger than 2) was provided, each integer value should refer to the fold, of the same length of the training data. Otherwise (if a list was provided), each element of the list must refer to a fold and they will be treated sequentially. Defaults to 5.

folds_weight

Type: vector of numerics. The weights assigned to each fold. If no weight is supplied (NA), the weights are automatically set to rep(1/length(folds)) for an average (does not mix well with folds with different sizes). When the folds are automatically created by supplying fold a vector of two integers, then the weights are automatically computed. Defaults to NA.

stratified

Type: boolean. Whether the folds should be stratified (keep the same label proportions) or not. Defaults to TRUE.

fold_seed

Type: integer or vector of integers. The seed for the random number generator. If a vector of integer is provided, its length should be at least longer than n. Otherwise (if an integer is supplied), it starts each fold with the provided seed, and adds 1 to the seed for every repeat. Defaults to 0.

fold_cleaning

Type: integer. When using cross-validation, data must be subsampled. This parameter controls how aggressive RAM usage should be against speed. The lower this value, the more aggressive the method to keep memory usage as low as possible. Defaults to 50.

Value

The folds and folds_weight elements in a list if cv_all = TRUE. All files are output and ready to use for lgbm.cv with files_exist = TRUE. If using train_all, it is ready to be used with lgbm.train and files_exist = TRUE. Returns "Success" if cv_all = FALSE and the code does not error mid-way.

Examples

Run this code

## Not run: ------------------------------------
# Prepare files for cross-validation.
# trained.cv <- lgbm.cv(y_train = targets,
#                       x_train = data[1:1500, ],
#                       workingdir = file.path(getwd(), "temp"),
#                       train_conf = 'lgbm_train.conf',
#                       train_name = 'lgbm_train.csv',
#                       val_name = 'lgbm_val.csv',
#                       folds = 3)
## ---------------------------------------------

Run the code above in your browser using DataLab