CivisML Sparse Logistic
civis_ml_sparse_logistic(
x,
dependent_variable,
primary_key = NULL,
excluded_columns = NULL,
penalty = c("l2", "l1"),
dual = FALSE,
tol = 1e-08,
C = 499999950,
fit_intercept = TRUE,
intercept_scaling = 1,
class_weight = NULL,
random_state = 42,
solver = c("liblinear", "newton-cg", "lbfgs", "sag"),
max_iter = 100,
multi_class = c("ovr", "multinomial"),
fit_params = NULL,
cross_validation_parameters = NULL,
calibration = NULL,
oos_scores_table = NULL,
oos_scores_db = NULL,
oos_scores_if_exists = c("fail", "append", "drop", "truncate"),
model_name = NULL,
cpu_requested = NULL,
memory_requested = NULL,
disk_requested = NULL,
notifications = NULL,
polling_interval = NULL,
verbose = FALSE,
civisml_version = "prod"
)
A civis_ml
object, a list containing the following elements:
job metadata from scripts_get_custom
.
run metadata from scripts_get_custom_runs
.
CivisML metadata from scripts_list_custom_runs_outputs
containing the locations of
files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators.
Parsed CivisML output from metrics.json
containing metadata from validation.
A list containing the following elements:
run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model with CV results.
metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc).
warnings list.
data_platform list, training data location.
Parsed CivisML output from model_info.json
containing metadata from training.
A list containing the following elements:
run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model.
metrics empty list.
warnings list.
data_platform list, training data location.
See the Data Sources section below.
The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped.
Optional, the unique ID (primary key) of the training
dataset. This will be used to index the out-of-sample scores. In
predict.civis_ml
, the primary_key of the training task is used by
default primary_key = NA
. Use primary_key = NULL
to
explicitly indicate the data have no primary_key.
Optional, a vector of columns which will be considered ineligible to be independent variables.
Used to specify the norm used in the penalization. The
newton-cg
, sag
, and lbfgs
solvers support only l2
penalties.
Dual or primal formulation. Dual formulation is only implemented
for l2
penalty with the liblinear
solver. dual = FALSE
should be preferred when n_samples > n_features.
Tolerance for stopping criteria.
Inverse of regularization strength, must be a positive float. Smaller values specify stronger regularization.
Should a constant or intercept term be included in the model.
Useful only when the solver = "liblinear"
and fit_intercept = TRUE
. In this case, a constant term with the
value intercept_scaling
is added to the design matrix.
A list
with class_label = value
pairs, or
balanced
. When class_weight = "balanced"
, the class weights
will be inversely proportional to the class frequencies in the input data
as:
$$ \frac{n_samples}{n_classes * table(y)} $$
Note, the class weights are multiplied with sample_weight
(passed via fit_params
) if sample_weight
is specified.
The seed of the random number generator to use when
shuffling the data. Used only in solver = "sag"
and
solver = "liblinear"
.
Algorithm to use in the optimization problem. For small data
liblinear
is a good choice. sag
is faster for larger
problems. For multiclass problems, only newton-cg
, sag
, and
lbfgs
handle multinomial loss. liblinear
is limited to
one-versus-rest schemes. newton-cg
, lbfgs
, and sag
only handle the l2
penalty.
Note that sag
fast convergence is only guaranteed on features with
approximately the same scale.
The maximum number of iterations taken for the solvers to
converge. Useful for the newton-cg
, sag
, and lbfgs
solvers.
The scheme for multi-class problems. When ovr
, then
a binary problem is fit for each label. When multinomial
, a single
model is fit minimizing the multinomial loss. Note, multinomial
only
works with the newton-cg
, sag
, and lbfgs
solvers.
Optional, a mapping from parameter names in the model's
fit
method to the column names which hold the data, e.g.
list(sample_weight = 'survey_weight_column')
.
Optional, parameter grid for learner
parameters, e.g. list(n_estimators = c(100, 200, 500),
learning_rate = c(0.01, 0.1), max_depth = c(2, 3))
or "hyperband"
for supported models.
Optional, if not NULL
, calibrate output
probabilities with the selected method, sigmoid
, or isotonic
.
Valid only with classification models.
Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename".
Optional, the name of the database where the
oos_scores_table
will be created. If not provided, this will default
to database_name
.
Optional, action to take if
oos_scores_table
already exists. One of "fail"
, "append"
, "drop"
, or "truncate"
.
The default is "fail"
.
Optional, the prefix of the Platform modeling jobs.
It will have " Train"
or " Predict"
added to become the Script title.
Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU.
Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB.
Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB.
Optional, model status notifications. See
scripts_post_custom
for further documentation about email
and URL notification.
Check for job completion every this number of seconds.
Optional, If TRUE
, supply debug outputs in Platform
logs and make prediction child jobs visible.
Optional, a one-length character vector of the CivisML version. The default is "prod", the latest version in production
For building models with civis_ml
, the training data can reside in
four different places, a file in the Civis Platform, a CSV or feather-format file
on the local disk, a data.frame
resident in local the R environment, and finally,
a table in the Civis Platform. Use the following helpers to specify the
data source when calling civis_ml
:
data.frame
civis_ml(x = df, ...)
civis_ml(x = "path/to/data.csv", ...)
civis_ml(x = civis_file(1234))
civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))
if (FALSE) {
df <- iris
names(df) <- gsub("\\.", "_", names(df))
m <- civis_ml_sparse_logistic(df, "Species")
yhat <- fetch_oos_scores(m)
# Grid Search
cv_params <- list(C = c(.01, 1, 10, 100, 1000))
m <- civis_ml_sparse_logistic(df, "Species",
cross_validation_parameters = cv_params)
# make a prediction job, storing in a redshift table
pred_info <- predict(m, newdata = civis_table("schema.table", "my_database"),
output_table = "schema.scores_table")
}
Run the code above in your browser using DataLab