CRTreeForest_pred: Complete-Random Tree Forest Predictor implementation in R

Description

This function attempts to predict from Complete-Random Tree Forests using xgboost. Predictions are deferred to CRTreeForest_pred_internals.

Usage

CRTreeForest_pred(model, data, folds = NULL, prediction = FALSE,
  multi_class = NULL, data_start = NULL, return_list = TRUE,
  work_dir = NULL)

Arguments

model

Type: list. A model trained by CRTreeForest.

data

Type: data.table. A data to predict on. If passing training data, it will predict as if it was out of fold and you will overfit (so, use the list train_preds instead please).

folds

Type: list. The folds as list for cross-validation if using the training data. Otherwise, leave NULL. Defaults to NULL.

prediction

Type: logical. Whether the predictions of the forest ensemble are averaged. Set it to FALSE for debugging / feature engineering. Setting it to TRUE overrides return_list. Defaults to FALSE.

multi_class

Type: numeric. How many classes you got. Set to 2 for binary classification, or regression cases. Set to NULL to let it try guessing by reading the model. Defaults to NULL.

data_start

Type: vector of numeric. The initial prediction labels. Set to NULL if you do not know what you are doing. Defaults to NULL.

return_list

Type: logical. Whether lists should be returned instead of concatenated frames for predictions. Defaults to TRUE.

work_dir

Type: character, without slash at end (ex: "dev/tools/save_in_this_folder"). The working directory where models are stored, if using external model files as memory. Defaults to NULL, which means models are in memory. It will attempt to detect automatically the working directory from the model if it is available.

Value

A data.table or a list based on data predicted using model.

Details

For implementation details of Cascade Forest / Complete-Random Tree Forest / Multi-Grained Scanning / Deep Forest, check this: https://github.com/Microsoft/LightGBM/issues/331#issuecomment-283942390 by Laurae.

Examples

Run this code

## Not run: ------------------------------------
# # Load libraries
# library(data.table)
# library(Matrix)
# library(xgboost)
# 
# # Create data
# data(agaricus.train, package = "lightgbm")
# data(agaricus.test, package = "lightgbm")
# agaricus_data_train <- data.table(as.matrix(agaricus.train$data))
# agaricus_data_test <- data.table(as.matrix(agaricus.test$data))
# agaricus_label_train <- agaricus.train$label
# agaricus_label_test <- agaricus.test$label
# folds <- Laurae::kfold(agaricus_label_train, 5)
# 
# # Train a model (binary classification)
# model <- CRTreeForest(training_data = agaricus_data_train, # Training data
#                       validation_data = agaricus_data_test, # Validation data
#                       training_labels = agaricus_label_train, # Training labels
#                       validation_labels = agaricus_label_test, # Validation labels
#                       folds = folds, # Folds for cross-validation
#                       nthread = 1, # Change this to use more threads
#                       lr = 1, # Do not touch this unless you are expert
#                       training_start = NULL, # Do not touch this unless you are expert
#                       validation_start = NULL, # Do not touch this unless you are expert
#                       n_forest = 5, # Number of forest models
#                       n_trees = 10, # Number of trees per forest
#                       random_forest = 2, # We want only 2 random forest
#                       seed = 0,
#                       objective = "binary:logistic",
#                       eval_metric = Laurae::df_logloss,
#                       return_list = TRUE, # Set this to FALSE for a data.table output
#                       multi_class = 2, # Modify this for multiclass problems
#                       verbose = " ")
# 
# # Predict from model
# new_preds <- CRTreeForest_pred(model, agaricus_data_test, return_list = FALSE)
# 
# # We can check whether we have equal predictions, it's all TRUE!
# all.equal(model$train_preds, CRTreeForest_pred(model, agaricus_data_train, folds = folds))
# all.equal(model$valid_preds, CRTreeForest_pred(model, agaricus_data_test))
# all.equal(model$train_means, CRTreeForest_pred(model,
#                                                agaricus_data_train,
#                                                folds = folds,
#                                                return_list = FALSE,
#                                                prediction = TRUE))
# all.equal(model$valid_means, CRTreeForest_pred(model,
#                                                agaricus_data_test,
#                                                return_list = FALSE,
#                                                prediction = TRUE))
# 
# # Attempt to perform fake multiclass problem
# agaricus_label_train[1:100] <- 2
# 
# # Train a model (multiclass classification)
# model <- CRTreeForest(training_data = agaricus_data_train, # Training data
#                       validation_data = agaricus_data_test, # Validation data
#                       training_labels = agaricus_label_train, # Training labels
#                       validation_labels = agaricus_label_test, # Validation labels
#                       folds = folds, # Folds for cross-validation
#                       nthread = 1, # Change this to use more threads
#                       lr = 1, # Do not touch this unless you are expert
#                       training_start = NULL, # Do not touch this unless you are expert
#                       validation_start = NULL, # Do not touch this unless you are expert
#                       n_forest = 5, # Number of forest models
#                       n_trees = 10, # Number of trees per forest
#                       random_forest = 2, # We want only 2 random forest
#                       seed = 0,
#                       objective = "multi:softprob",
#                       eval_metric = Laurae::df_logloss,
#                       return_list = TRUE, # Set this to FALSE for a data.table output
#                       multi_class = 3, # Modify this for multiclass problems
#                       verbose = " ")
# 
# # Predict from model for mutliclass problems
# new_preds <- CRTreeForest_pred(model, agaricus_data_test, return_list = FALSE)
# 
# # We can check whether we have equal predictions, it's all TRUE!
# all.equal(model$train_preds, CRTreeForest_pred(model, agaricus_data_train, folds = folds))
# all.equal(model$valid_preds, CRTreeForest_pred(model, agaricus_data_test))
# all.equal(model$train_means, CRTreeForest_pred(model,
#                                                agaricus_data_train,
#                                                folds = folds,
#                                                return_list = FALSE,
#                                                prediction = TRUE))
# all.equal(model$valid_means, CRTreeForest_pred(model,
#                                                agaricus_data_test,
#                                                return_list = FALSE,
#                                                prediction = TRUE))
## ---------------------------------------------

Run the code above in your browser using DataLab