These are the base classes for creating new preprocessing blueprints. All
blueprints inherit from the one created by new_blueprint()
, and the default
method specific blueprints inherit from the other three here.
If you want to create your own processing blueprint for a specific method,
generally you will subclass one of the method specific blueprints here. If
you want to create a completely new preprocessing blueprint for a totally new
preprocessing method (i.e. not the formula, xy, or recipe method) then
you should subclass new_blueprint()
.
new_formula_blueprint(
mold,
forge,
intercept = FALSE,
allow_novel_levels = FALSE,
ptypes = NULL,
formula = NULL,
indicators = TRUE,
...,
subclass = character()
)new_recipe_blueprint(
mold,
forge,
intercept = FALSE,
allow_novel_levels = FALSE,
fresh = FALSE,
ptypes = NULL,
recipe = NULL,
...,
subclass = character()
)
new_xy_blueprint(
mold,
forge,
intercept = FALSE,
allow_novel_levels = FALSE,
ptypes = NULL,
...,
subclass = character()
)
new_blueprint(
mold,
forge,
intercept = FALSE,
allow_novel_levels = FALSE,
ptypes = NULL,
...,
subclass = character()
)
A named list with two elements, clean
and process
, see
the new_blueprint()
section, Mold Functions, for details.
A named list with two elements, clean
and process
, see
the new_blueprint()
section, Forge Functions, for details.
A logical. Should an intercept be included in the
processed data? This information is used by the process
function
in the mold
and forge
function list.
A logical. Should novel factor levels be allowed at
prediction time? This information is used by the clean
function in the
forge
function list, and is passed on to scream()
.
Either NULL
, or a named list with 2 elements, predictors
and outcomes
, both of which are 0-row tibbles. ptypes
is generated
automatically at mold()
time and is used to validate new_data
at
prediction time. At mold()
time, the information found in
blueprint$mold$process()$ptype
is used to set ptypes
for the blueprint
.
Either NULL
, or a formula that specifies how the
predictors and outcomes should be preprocessed. This argument is set
automatically at mold()
time.
A logical. Should factors be expanded into dummy variables?
Name-value pairs for additional elements of blueprints that subclass this blueprint.
A character vector. The subclasses of this blueprint.
Should already trained operations be re-trained when prep()
is
called?
Either NULL
, or an unprepped recipe. This argument is set
automatically at mold()
time.
A preprocessing blueprint, which is a list containing the inputs used as arguments to the function, along with a class specific to the type of blueprint being created.
blueprint$mold
should be a named list with two elements, both of which
are functions:
clean
: A function that performs initial cleaning of the user's input
data to be used in the model.
Arguments:
If this is an xy blueprint, blueprint
, x
and y
.
Otherwise, blueprint
and data
.
Output: A named list of three elements:
blueprint
: The blueprint, returned and potentially updated.
If using an xy blueprint:
x
: The cleaned predictor data.
y
: The cleaned outcome data.
If not using an xy blueprint:
data
: The cleaned data.
process
: A function that performs the actual preprocessing of the data.
Arguments:
If this is an xy blueprint, blueprint
, x
and y
.
Otherwise, blueprint
and data
.
Output: A named list of 5 elements:
blueprint
: The blueprint, returned and potentially updated.
predictors
: A tibble of predictors.
outcomes
: A tibble of outcomes.
ptypes
: A named list with 2 elements, predictors
and outcomes
,
where both elements are 0-row tibbles.
extras
: Varies based on the blueprint. If the blueprint has no
extra information, NULL
. Otherwise a named list of the
extra elements returned by the blueprint.
Both blueprint$mold$clean()
and blueprint$mold$process()
will be called,
in order, from mold()
.
blueprint$forge
should be a named list with two elements, both of which
are functions:
clean
: A function that performs initial cleaning of new_data
:
Arguments:
blueprint
, new_data
, and outcomes
.
Output: A named list of the following elements:
blueprint
: The blueprint, returned and potentially updated.
predictors
: A tibble containing the cleaned predictors.
outcomes
: A tibble containing the cleaned outcomes.
extras
: A named list of any extras obtained while cleaning. These
are passed on to the process()
function for further use.
process
: A function that performs the actual preprocessing of the data
using the known information in the blueprint
.
Arguments:
blueprint
, new_data
, outcomes
, extras
.
Output: A named list of the following elements:
blueprint
: The blueprint, returned and potentially updated.
predictors
: A tibble of the predictors.
outcomes
: A tibble of the outcomes, or NULL
.
extras
: Varies based on the blueprint. If the blueprint has no
extra information, NULL
. Otherwise a named list of the
extra elements returned by the blueprint.
Both blueprint$forge$clean()
and blueprint$forge$process()
will be called,
in order, from forge()
.