install.packages("data.table", type = "source", repos = "http://Rdatatable.github.io/data.table")
.
SVMLight conversion requires Laurae's sparsity package, which can be installed using devtools:::install_github("Laurae2/sparsity")
. SVMLight format extension used is .svm
.
Does not handle weights or groups.lgbm.cv.prep(y_train, x_train, x_test = NA, SVMLight = is(x_train,
"dgCMatrix"), data_has_label = FALSE, NA_value = "nan",
workingdir = getwd(), train_all = FALSE, test_all = FALSE,
cv_all = TRUE, train_name = paste0("lgbm_train", ifelse(SVMLight, ".svm",
".csv")), val_name = paste0("lgbm_val", ifelse(SVMLight, ".svm", ".csv")),
test_name = paste0("lgbm_test", ifelse(SVMLight, ".svm", ".csv")),
verbose = TRUE, folds = 5, folds_weight = NA, stratified = TRUE,
fold_seed = 0, fold_cleaning = 50)
SVMLight = TRUE
). The training features.SVMLight = TRUE
). The testing features, if necessary. Not providing a data.frame or a matrix results in at least 3x memory usage. Defaults to NA
.TRUE
enforces you must provide labels separately (in y_train
) and headers will be ignored. This is default behavior of SVMLight format. Defaults to FALSE
.FALSE
."na"
if you want to specify "missing". It is not recommended to use something else, even by soemthing like a numeric value out of bounds (like -999
if all your values are greater than -999
). You should change from the default "na"
if they have a real numeric meaning. Defaults to "na"
.getwd()
.lgbm.train
. Defaults to FALSE
.lgbm.train
. Defaults to FALSE
.lgbm.cv
. Defaults to TRUE
.paste0('lgbm_train', ifelse(SVMLight, '.svm', '.csv'))
.paste0('lgbm_val', ifelse(SVMLight, '.svm', '.csv'))
.paste0('lgbm_test', ifelse(SVMLight, '.svm', '.csv'))
.fwrite
data is output. Defaults to TRUE
.folds
-fold cross-validation. If a vector of two integers is supplied, performs a folds[1]
-fold cross-validation repeated folds[2]
times. If a vector of integers (larger than 2) was provided, each integer value should refer to the fold, of the same length of the training data. Otherwise (if a list was provided), each element of the list must refer to a fold and they will be treated sequentially. Defaults to 5
.NA
), the weights are automatically set to rep(1/length(folds))
for an average (does not mix well with folds with different sizes). When the folds are automatically created by supplying fold
a vector of two integers, then the weights are automatically computed. Defaults to NA
.TRUE
.n
. Otherwise (if an integer is supplied), it starts each fold with the provided seed, and adds 1 to the seed for every repeat. Defaults to 0
.50
.folds
and folds_weight
elements in a list if cv_all = TRUE
. All files are output and ready to use for lgbm.cv
with files_exist = TRUE
. If using train_all
, it is ready to be used with lgbm.train
and files_exist = TRUE
. Returns "Success"
if cv_all = FALSE
and the code does not error mid-way.## Not run: ------------------------------------
# Prepare files for cross-validation.
# trained.cv <- lgbm.cv(y_train = targets,
# x_train = data[1:1500, ],
# workingdir = file.path(getwd(), "temp"),
# train_conf = 'lgbm_train.conf',
# train_name = 'lgbm_train.csv',
# val_name = 'lgbm_val.csv',
# folds = 3)
## ---------------------------------------------
Run the code above in your browser using DataLab