new_formula_blueprint: Create a new preprocessing blueprint

Description

These are the base classes for creating new preprocessing blueprints. All blueprints inherit from the one created by new_blueprint(), and the default method specific blueprints inherit from the other three here.

If you want to create your own processing blueprint for a specific method, generally you will subclass one of the method specific blueprints here. If you want to create a completely new preprocessing blueprint for a totally new preprocessing method (i.e. not the formula, xy, or recipe method) then you should subclass new_blueprint().

Usage

new_formula_blueprint(
  mold,
  forge,
  intercept = FALSE,
  allow_novel_levels = FALSE,
  ptypes = NULL,
  formula = NULL,
  indicators = "traditional",
  composition = "tibble",
  ...,
  subclass = character()
)
new_recipe_blueprint(
  mold,
  forge,
  intercept = FALSE,
  allow_novel_levels = FALSE,
  fresh = TRUE,
  bake_dependent_roles = character(),
  composition = "tibble",
  ptypes = NULL,
  recipe = NULL,
  ...,
  subclass = character()
)
new_xy_blueprint(
  mold,
  forge,
  intercept = FALSE,
  allow_novel_levels = FALSE,
  composition = "tibble",
  ptypes = NULL,
  ...,
  subclass = character()
)
new_blueprint(
  mold,
  forge,
  intercept = FALSE,
  allow_novel_levels = FALSE,
  composition = "tibble",
  ptypes = NULL,
  ...,
  subclass = character()
)

Arguments

mold

A named list with two elements, clean and process, see the new_blueprint() section, Mold Functions, for details.

forge

A named list with two elements, clean and process, see the new_blueprint() section, Forge Functions, for details.

intercept

A logical. Should an intercept be included in the processed data? This information is used by the process function in the mold and forge function list.

allow_novel_levels

A logical. Should novel factor levels be allowed at prediction time? This information is used by the clean function in the forge function list, and is passed on to scream().

ptypes

Either NULL, or a named list with 2 elements, predictors and outcomes, both of which are 0-row tibbles. ptypes is generated automatically at mold() time and is used to validate new_data at prediction time. At mold() time, the information found in blueprint$mold$process()$ptype is used to set ptypes for the blueprint.

formula

Either NULL, or a formula that specifies how the predictors and outcomes should be preprocessed. This argument is set automatically at mold() time.

indicators

A single character string. Control how factors are expanded into dummy variable indicator columns. One of:

"traditional" - The default. Create dummy variables using the traditional model.matrix() infrastructure. Generally this creates K - 1 indicator columns for each factor, where K is the number of levels in that factor.
"none" - Leave factor variables alone. No expansion is done.
"one_hot" - Create dummy variables using a one-hot encoding approach that expands unordered factors into all K indicator columns, rather than K - 1.

composition

Either "tibble", "matrix", or "dgCMatrix" for the format of the processed predictors. If "matrix" or "dgCMatrix" are chosen, all of the predictors must be numeric after the preprocessing method has been applied; otherwise an error is thrown.

...

Name-value pairs for additional elements of blueprints that subclass this blueprint.

subclass

A character vector. The subclasses of this blueprint.

fresh

Should already trained operations be re-trained when prep() is called?

bake_dependent_roles

A character vector of recipes column "roles" specifying roles that are required to recipes::bake() new data. Can't be "predictor" or "outcome", as predictors are always required and outcomes are handled by the outcomes argument of forge().

Typically, non-standard roles (such as "id" or "case_weights") are not required to bake() new data. Unless specified by bake_dependent_roles, these non-standard role columns are excluded from checks done in forge() to validate the column structure of new_data, will not be passed to bake() even if they existed in new_data, and will not be returned in the forge()$extras$roles slot. See the documentation of recipes::add_role() for more information about roles.

recipe

Either NULL, or an unprepped recipe. This argument is set automatically at mold() time.

Value

A preprocessing blueprint, which is a list containing the inputs used as arguments to the function, along with a class specific to the type of blueprint being created.

Mold Functions

blueprint$mold should be a named list with two elements, both of which are functions:

clean: A function that performs initial cleaning of the user's input data to be used in the model.
- Arguments:
  - If this is an xy blueprint, blueprint, x and y.
  - Otherwise, blueprint and data.
- Output: A named list of three elements:
  - blueprint: The blueprint, returned and potentially updated.
  - If using an xy blueprint:
    - x: The cleaned predictor data.
    - y: The cleaned outcome data.
  - If not using an xy blueprint:
    - data: The cleaned data.
process: A function that performs the actual preprocessing of the data.
- Arguments:
  - If this is an xy blueprint, blueprint, x and y.
  - Otherwise, blueprint and data.
- Output: A named list of 5 elements:
  - blueprint: The blueprint, returned and potentially updated.
  - predictors: A tibble of predictors.
  - outcomes: A tibble of outcomes.
  - ptypes: A named list with 2 elements, predictors and outcomes, where both elements are 0-row tibbles.
  - extras: Varies based on the blueprint. If the blueprint has no extra information, NULL. Otherwise a named list of the extra elements returned by the blueprint.

Both blueprint$mold$clean() and blueprint$mold$process() will be called, in order, from mold().

Forge Functions

blueprint$forge should be a named list with two elements, both of which are functions:

clean: A function that performs initial cleaning of new_data:
- Arguments:
  - blueprint, new_data, and outcomes.
- Output: A named list of the following elements:
  - blueprint: The blueprint, returned and potentially updated.
  - predictors: A tibble containing the cleaned predictors.
  - outcomes: A tibble containing the cleaned outcomes.
  - extras: A named list of any extras obtained while cleaning. These are passed on to the process() function for further use.
process: A function that performs the actual preprocessing of the data using the known information in the blueprint.
- Arguments:
  - blueprint, new_data, outcomes, extras.
- Output: A named list of the following elements:
  - blueprint: The blueprint, returned and potentially updated.
  - predictors: A tibble of the predictors.
  - outcomes: A tibble of the outcomes, or NULL.
  - extras: Varies based on the blueprint. If the blueprint has no extra information, NULL. Otherwise a named list of the extra elements returned by the blueprint.

Both blueprint$forge$clean() and blueprint$forge$process() will be called, in order, from forge().