Learn R Programming

Laurae (version

LauraeML: Laurae's Machine Learning (Automated modeling, Automated stacking)


This function attempts to perform automated modeling (use machine learning models, select features). It is optimized for maximum speed, therefore the user has a lot of chore to perform before using this function.


LauraeML(data, label, folds, seed = 0, models = NULL, parallelized = NULL,
  optimize = TRUE, no_train = FALSE, logging = NULL, maximize = TRUE,
  features = 0.5, hyperparams = NULL, n_tries = 50, n_iters = 50,
  early_stop = 5, elites = 0.1, feature_smoothing = 1,
  converge_cont = 0.1, converge_disc = 0.1)


Type: data.table (mandatory). The data features. dgCMatrix format support is planned in the future, but not today.
Type: vector (numeric). The labels. For classes, use a numbering starting from 0.
Type: list of numerics. A list containing per element, the observation rows for the folds, which is passed to your modeling functions.
Type: numeric. The seed for random number generation. Defaults to 0.
Type: list of functions. A list of functions, taking each a x (numeric vector of hyperparameters), y (numeric vector of features used, where each n-th index refers to the n-th feature, with 0 being not selected, and 1 being selected), data (data data.table), folds (folds list) arguments, transforming the data accordingly depending on the features used, doing validation properly, and returning the cross-validated score to optimize. If you do not want to do cross-validation, you are free to not perform it as no check is performed inside the model functions. You can get the number of the models trained using iters, which is overwritten in the global environment (and which you should increment in the model functions, if you intend to use it). You can also use hi_score to get the best score, which is overwritten in the global environment (you can make use of it in your model functions.
Type: parallel socket cluster (makeCluster or similar). When specified, data is split (in a list) before being fed to the modeling functions (with a list per fold containing first the training data, and second the testing data), at the expense of drastically increasing memory usage. Defaults to NULL, to lower memory usage. You should set it to if you want pure speed and have enough available RAM to handle the dataset multiple times (length(folds) times).
Type: boolean. Whether to perform optimization or take everything as is (no optimization of any parameters). Defaults to TRUE, which means an attempt to optimize hyperparameters and/or features.
Type: boolean. When optimize is FALSE and your only need is to create the list to be usable later for training all models, set this to TRUE. Otherwise, never touch it. Defaults to FALSE.
Type: character. The log file output. The logging must be done in the variable mobile$temp_params. The first column is the ID of the model optimization iteration (there are (n_iters + 1) * n_tries iterations), the second column is the score of that iteration, then the following columns are about the hyperparameters used, while the last columns are the features used. It has (n_iters + 1) * n_tries rows, and length(hyperparams[[i]][[1]]) + ncol(data) + 2 columns for a model of index i in models. Defaults to NULL, which means no logging.
Type: boolean. Whether to maximize (TRUE) or minimize (FALSE) the metric returned by the model functions. Defaults to TRUE.
Type: numeric. The approximate percentage of features that should be selected. This parameter is ignored when features when you underestimate the number of features you really need. Defaults to 0.50, which means an attempt to use half of features only.
Type: list of list of vector of numerics. Contains the hyperparameter interval to optimize per function. Each hyperparameter must have 4 lists, containing separately the mean (first list), the standard deviation (second list), the minimum (third list) and the maximum (fourth list) allowed. This is still used to fetch hyperparameters to pass when optimize = FALSE, you should just pass one vector per list in this specific case (containing the hyperparamters used for each model).
Type: numeric. The number of tries allowed to optimize per iteration of optimization of each model. To get the total number of models trained, you must multiplicate it with n_iters + 1. Defaults to 50, which means 2550 models trained by default. Useless when optimize = FALSE.
Type: numeric. The numbers of iterations allowed for optimization of each model. To get the total number of models trained, you must multiplicate it with n_tries after adding 1 to n_iters. Defaults to 50, which means 2550 models trained by default. Useless when optimize = FALSE.
Type: numeric. The number of optimization iterations allowed without any improvements of the metric returned by the model functions. Defaults to 5, which means stopping after 6 optimization iterations without improvement of the metric returned by the model functions. Useless when optimize = FALSE.
Type: numeric. The percentage of best results taken in each iteration of optimization to use as a baseline. The higher the number, the slower the convergence (but the stabler the iteration updates). Must be between 0 and 1. The multiplication of n_tries and elites must return an integer (and not decimal). Defaults to 0.1.
Type: numeric. The smoothing factor applied to feature selection to not pick strong features too fast. Must be between 0 and 1. Defaults to 1, which means no smoothing is applied. A lower value decreases the convergence speed.
Type: numeric. The minimum allowed standard deviation of the maximum standard deviations of continuous variables. If all hyperparameters' standard deviation fall below converge_cont during optimization, we suppose the optimizer having converged. Defaults to 0.1.
Type: numeric. The minimum allowed single class probability of the maximum single class of discrete variables. If all features' maximum probability (of either 0 or 1) fall below converge_disc during optimization, we suppose the optimizer having converged. Defaults to 0.1.


The score of the models along with their hyperparameters.


This is a mega function.


Run this code
## Not run: ------------------------------------
# # Not tabulated well to keep under 100 characters per line
# mega_model <- LauraeML(data = data,
# label = targets,
# folds = list(1:1460, 1461:2919),
# seed = 0,
# models = list(lgb = LauraeML_lgbreg,
#               xgb = LauraeML_gblinear),
#          parallelized = FALSE,
#          optimize = TRUE,
#          no_train = FALSE,
#          logging = NULL,
#          maximize = FALSE, # FALSE on RMSE, fast example of doing the worst
#          features = 0.50,
#          hyperparams = list(lgb = list(Mean = c(5, 5, 1, 0.7, 0.7, 0.5, 0.5),
#                                        Sd = c(3, 3, 1, 0.2, 0.2, 0.5, 0.5),
#                                        Min = c(1, 1, 0, 0.1, 0.1, 0, 0),
#                                        Max = c(15, 50, 50, 1, 1, 50, 50)),
#                             xgb = list(Mean = c(1, 1, 1),
#                                        Sd = c(1, 1, 1),
#                                        Min = c(0, 0, 0),
#                                        Max = c(2, 2, 2))),
#          n_tries = 10, # Set this big, preferably 10 * number of features
#          n_iters = 1, # Set this big to like 50
#          early_stop = 2,
#          elites = 0.4,
#          feature_smoothing = 1,
#          converge_cont = 0.5,
#          converge_disc = 0.25)
## ---------------------------------------------

Run the code above in your browser using DataLab