CivisML Gradient Boosting Regressor
civis_ml_gradient_boosting_regressor(
x,
dependent_variable,
primary_key = NULL,
excluded_columns = NULL,
loss = c("ls", "lad", "huber", "quantile"),
learning_rate = 0.1,
n_estimators = 500,
subsample = 1,
criterion = c("friedman_mse", "mse", "mae"),
min_samples_split = 2,
min_samples_leaf = 1,
min_weight_fraction_leaf = 0,
max_depth = 2,
min_impurity_split = 1e-07,
random_state = 42,
max_features = "sqrt",
alpha = 0.9,
max_leaf_nodes = NULL,
presort = c("auto", TRUE, FALSE),
fit_params = NULL,
cross_validation_parameters = NULL,
oos_scores_table = NULL,
oos_scores_db = NULL,
oos_scores_if_exists = c("fail", "append", "drop", "truncate"),
model_name = NULL,
cpu_requested = NULL,
memory_requested = NULL,
disk_requested = NULL,
notifications = NULL,
polling_interval = NULL,
verbose = FALSE,
civisml_version = "prod"
)
A civis_ml
object, a list containing the following elements:
job metadata from scripts_get_custom
.
run metadata from scripts_get_custom_runs
.
CivisML metadata from scripts_list_custom_runs_outputs
containing the locations of
files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators.
Parsed CivisML output from metrics.json
containing metadata from validation.
A list containing the following elements:
run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model with CV results.
metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc).
warnings list.
data_platform list, training data location.
Parsed CivisML output from model_info.json
containing metadata from training.
A list containing the following elements:
run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model.
metrics empty list.
warnings list.
data_platform list, training data location.
See the Data Sources section below.
The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped.
Optional, the unique ID (primary key) of the training
dataset. This will be used to index the out-of-sample scores. In
predict.civis_ml
, the primary_key of the training task is used by
default primary_key = NA
. Use primary_key = NULL
to
explicitly indicate the data have no primary_key.
Optional, a vector of columns which will be considered ineligible to be independent variables.
The loss function to be optimized. ls
refers to least
squares regression. lad
(least absolute deviation) is a highly
robust loss function solely based on order information of the input
variables. huber
is a combination of the two. quantile
allows quantile regression (use alpha
to specify the quantile).
The learning rate shrinks the contribution of each tree
by learning_rate
. There is a trade-off between learning_rate
and n_estimators
.
The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting, so a large number usually results in better predictive performance.
The fraction of samples to be used for fitting individual
base learners. If smaller than 1.0, this results in Stochastic Gradient
Boosting. subsample
interacts with the parameter n_estimators
.
Choosing subsample < 1.0
leads to a reduction of variance and an
increase in bias.
The function to measure the quality of a split. The default
value criterion = "friedman_mse"
is generally the best as it can
provide a better approximation in some cases.
The minimum number of samples required to split
an internal node. If an integer, then consider min_samples_split
as the minimum number. If a float, then min_samples_split
is a
percentage and ceiling(min_samples_split * n_samples)
are the
minimum number of samples for each split.
The minimum number of samples required to be in
a leaf node. If an integer, then consider min_samples_leaf
as the
minimum number. If a float, the min_samples_leaf
is a percentage
and ceiling(min_samples_leaf * n_samples)
are the minimum number
of samples for each leaf node.
The minimum weighted fraction of the sum total of weights required to be at a leaf node.
Maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance. The best value depends on the interaction of the input variables.
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
The seed of the random number generator.
The number of features to consider when looking for the best split.
consider max_features
at each split.
then max_features
is a percentage and
max_features * n_features
are considered at each split.
then max_features = sqrt(n_features)
then max_features = sqrt(n_features)
then max_features = log2(n_features)
then max_features = n_features
The alpha-quantile of the huber
loss function and the
quantile
loss function. Ignored unless loss = "huber"
or
loss = "quantile"
Grow trees with max_leaf_nodes
in best-first
fashion. Best nodes are defined as relative reduction to impurity. If
max_leaf_nodes = NULL
then unlimited number of leaf nodes.
Whether to presort the data to speed up the finding of best splits in fitting.
Optional, a mapping from parameter names in the model's
fit
method to the column names which hold the data, e.g.
list(sample_weight = 'survey_weight_column')
.
Optional, parameter grid for learner
parameters, e.g. list(n_estimators = c(100, 200, 500),
learning_rate = c(0.01, 0.1), max_depth = c(2, 3))
or "hyperband"
for supported models.
Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename".
Optional, the name of the database where the
oos_scores_table
will be created. If not provided, this will default
to database_name
.
Optional, action to take if
oos_scores_table
already exists. One of "fail"
, "append"
, "drop"
, or "truncate"
.
The default is "fail"
.
Optional, the prefix of the Platform modeling jobs.
It will have " Train"
or " Predict"
added to become the Script title.
Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU.
Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB.
Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB.
Optional, model status notifications. See
scripts_post_custom
for further documentation about email
and URL notification.
Check for job completion every this number of seconds.
Optional, If TRUE
, supply debug outputs in Platform
logs and make prediction child jobs visible.
Optional, a one-length character vector of the CivisML version. The default is "prod", the latest version in production
For building models with civis_ml
, the training data can reside in
four different places, a file in the Civis Platform, a CSV or feather-format file
on the local disk, a data.frame
resident in local the R environment, and finally,
a table in the Civis Platform. Use the following helpers to specify the
data source when calling civis_ml
:
data.frame
civis_ml(x = df, ...)
civis_ml(x = "path/to/data.csv", ...)
civis_ml(x = civis_file(1234))
civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))
if (FALSE) {
data(ChickWeight)
m <- civis_ml_gradient_boosting_regressor(ChickWeight,
dependent_variable = "weight",
learning_rate = .01,
n_estimators = 100,
subsample = .5,
max_depth = 5,
max_features = NULL)
yhat <- fetch_oos_scores(m)
# Grid Search
cv_params <- list(
n_estimators = c(100, 200, 500),
learning_rate = c(.01, .1),
max_depth = c(2, 3))
m <- civis_ml_gradient_boosting_regressor(ChickWeight,
dependent_variable = "weight",
subsample = .5,
max_features = NULL,
cross_validation_parameters = cv_params)
pred_info <- predict(m, civis_table("schema.table", "my_database"),
output_table = "schema.scores_table")
}
Run the code above in your browser using DataLab