civis_ml_sparse_linear_regressor: CivisML Sparse Linear Regression

Description

CivisML Sparse Linear Regression

Usage

civis_ml_sparse_linear_regressor(
  x,
  dependent_variable,
  primary_key = NULL,
  excluded_columns = NULL,
  fit_intercept = TRUE,
  normalize = FALSE,
  fit_params = NULL,
  cross_validation_parameters = NULL,
  oos_scores_table = NULL,
  oos_scores_db = NULL,
  oos_scores_if_exists = c("fail", "append", "drop", "truncate"),
  model_name = NULL,
  cpu_requested = NULL,
  memory_requested = NULL,
  disk_requested = NULL,
  notifications = NULL,
  polling_interval = NULL,
  verbose = FALSE,
  civisml_version = "prod"
)

Value

A civis_ml object, a list containing the following elements:

job

job metadata from scripts_get_custom.

run

run metadata from scripts_get_custom_runs.

outputs

CivisML metadata from scripts_list_custom_runs_outputs containing the locations of files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators.

metrics

Parsed CivisML output from metrics.json containing metadata from validation. A list containing the following elements:

run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model with CV results.
metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc).
warnings list.
data_platform list, training data location.

model_info

Parsed CivisML output from model_info.json containing metadata from training. A list containing the following elements:

run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model.
metrics empty list.
warnings list.
data_platform list, training data location.

Arguments

x: See the Data Sources section below.
dependent_variable: The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped.
primary_key: Optional, the unique ID (primary key) of the training dataset. This will be used to index the out-of-sample scores. In predict.civis_ml, the primary_key of the training task is used by default primary_key = NA. Use primary_key = NULL to explicitly indicate the data have no primary_key.
excluded_columns: Optional, a vector of columns which will be considered ineligible to be independent variables.
fit_intercept: Should an intercept term be included in the model. If FALSE, no intercept will be included, in this case the data are expected to already be centered.
normalize: If TRUE, the regressors will be normalized before fitting the model. normalize is ignored when fit_intercept = FALSE.
fit_params: Optional, a mapping from parameter names in the model's fit method to the column names which hold the data, e.g. list(sample_weight = 'survey_weight_column').
cross_validation_parameters: Optional, parameter grid for learner parameters, e.g. list(n_estimators = c(100, 200, 500), learning_rate = c(0.01, 0.1), max_depth = c(2, 3)) or "hyperband" for supported models.
oos_scores_table: Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename".
oos_scores_db: Optional, the name of the database where the oos_scores_table will be created. If not provided, this will default to database_name.
oos_scores_if_exists: Optional, action to take if oos_scores_table already exists. One of "fail", "append", "drop", or "truncate". The default is "fail".
model_name: Optional, the prefix of the Platform modeling jobs. It will have " Train" or " Predict" added to become the Script title.
cpu_requested: Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU.
memory_requested: Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB.
disk_requested: Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB.
notifications: Optional, model status notifications. See scripts_post_custom for further documentation about email and URL notification.
polling_interval: Check for job completion every this number of seconds.
verbose: Optional, If TRUE, supply debug outputs in Platform logs and make prediction child jobs visible.
civisml_version: Optional, a one-length character vector of the CivisML version. The default is "prod", the latest version in production

Data Sources

For building models with civis_ml, the training data can reside in four different places, a file in the Civis Platform, a CSV or feather-format file on the local disk, a data.frame resident in local the R environment, and finally, a table in the Civis Platform. Use the following helpers to specify the data source when calling civis_ml:

data.frame: civis_ml(x = df, ...)
local csv file: civis_ml(x = "path/to/data.csv", ...)
file in Civis Platform: civis_ml(x = civis_file(1234))
table in Civis Platform: civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))

Examples

Run this code

if (FALSE) {
 data(ChickWeight)
 m <- civis_ml_sparse_linear_regressor(ChickWeight, dependent_variable = "weight")
 yhat <- fetch_oos_scores(m)

# make a prediction job, storing in a redshift table
pred_info <- predict(m, newdata = civis_table("schema.table", "my_database"),
   output_table = "schema.scores_table")

}

Run the code above in your browser using DataLab