Defining recipes
Variables in recipes can have any type of role, including outcome,
predictor, observation ID, case weights, stratification variables, etc.
recipe
objects can be created in several ways. If an analysis only
contains outcomes and predictors, the simplest way to create one is to
use a formula (e.g.<U+00A0>y ~ x1 + x2
) that does not contain inline
functions such as log(x3)
(see the first example below).
Alternatively, a recipe
object can be created by first specifying
which variables in a data set should be used and then sequentially
defining their roles (see the last example). This alternative is an
excellent choice when the number of variables is very high, as the
formula method is memory-inefficient with many variables.
There are two different types of operations that can be sequentially
added to a recipe.
Steps can include operations like scaling a variable, creating
dummy variables or interactions, and so on. More computationally
complex actions such as dimension reduction or imputation can also
be specified.
Checks are operations that conduct specific tests of the data.
When the test is satisfied, the data are returned without issue or
modification. Otherwise, an error is thrown.
If you have defined a recipe and want to see which steps are included,
use the tidy()
method on the recipe object.
Note that the data passed to recipe()
need not be the complete data
that will be used to train the steps (by prep()
). The recipe
only needs to know the names and types of data that will be used. For
large data sets, head()
could be used to pass a smaller data set to
save time and memory.
Using recipes
Once a recipe is defined, it needs to be estimated before being
applied to data. Most recipe steps have specific quantities that must be
calculated or estimated. For example, step_normalize()
needs to
compute the training set<U+2019>s mean for the selected columns, while
step_dummy()
needs to determine the factor levels of selected columns
in order to make the appropriate indicator columns.
The two most common application of recipes are modeling and stand-alone
preprocessing. How the recipe is estimated depends on how it is being
used.
Modeling
The best way to use use a recipe for modeling is via the workflows
package. This bundles a model and preprocessor (e.g.<U+00A0>a recipe) together
and gives the user a fluent way to train the model/recipe and make
predictions.
library(dplyr)
library(workflows)
library(recipes)
library(parsnip)
data(biomass, package = "modeldata")
# split data
biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]
# With only predictors and outcomes, use a formula:
rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur,
data = biomass_tr)
# Now add preprocessing steps to the recipe:
sp_signed <-
rec %>%
step_normalize(all_numeric_predictors()) %>%
step_spatialsign(all_numeric_predictors())
sp_signed
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 5
##
## Operations:
##
## Centering and scaling for all_numeric_predictors()
## Spatial sign on all_numeric_predictors()
We can create a parsnip
model, and then build a workflow with the
model and recipe:
linear_mod <- linear_reg()
linear_sp_sign_wflow <-
workflow() %>%
add_model(linear_mod) %>%
add_recipe(sp_signed)
linear_sp_sign_wflow
## == Workflow ==========================================================
## Preprocessor: Recipe
## Model: linear_reg()
##
## -- Preprocessor ------------------------------------------------------
## 2 Recipe Steps
##
## * step_normalize()
## * step_spatialsign()
##
## -- Model -------------------------------------------------------------
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
To estimate the preprocessing steps and then fit the linear model, a
single call to fit()
is used:
linear_sp_sign_fit <- fit(linear_sp_sign_wflow, data = biomass_tr)
When predicting, there is no need to do anything other than call
predict()
. This preprocesses the new data in the same manner as the
training set, then gives the data to the linear model prediction code:
predict(linear_sp_sign_fit, new_data = head(biomass_te))
## # A tibble: 6 x 1
## .pred
## <dbl>
## 1 18.1
## 2 17.9
## 3 17.2
## 4 18.8
## 5 19.6
## 6 14.6
Stand-alone use of recipes
When using a recipe to generate data for a visualization or to
troubleshoot any problems with the recipe, there are functions that can
be used to estimate the recipe and apply it to new data manually.
Once a recipe has been defined, the prep()
function can be
used to estimate quantities required for the operations using a data set
(a.k.a. the training data). prep()
returns a recipe.
As an example of using PCA (perhaps to produce a plot):
# Define the recipe
pca_rec <-
rec %>%
step_normalize(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors())
Now to estimate the normalization statistics and the PCA loadings:
pca_rec <- prep(pca_rec, training = biomass_tr)
pca_rec
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 5
##
## Training data contained 456 data points and no missing data.
##
## Operations:
##
## Centering and scaling for carbon, hydrogen, oxygen, nitrogen, s... [trained]
## PCA extraction with carbon, hydrogen, oxygen, nitrogen, su... [trained]
Note that the estimated recipe shows the actual column names captured by
the selectors.
You can tidy.recipe()
a recipe, either when it is
prepped or unprepped, to learn more about its components.
tidy(pca_rec)
## # A tibble: 2 x 6
## number operation type trained skip id
## <int> <chr> <chr> <lgl> <lgl> <chr>
## 1 1 step normalize TRUE FALSE normalize_AeYA4
## 2 2 step pca TRUE FALSE pca_Zn1yz
You can also tidy()
recipe steps with a number
or id
argument.
To apply the prepped recipe to a data set, the bake()
function is used in the same manner that predict()
would be for
models. This applies the estimated steps to any data set.
bake(pca_rec, head(biomass_te))
## # A tibble: 6 x 6
## HHV PC1 PC2 PC3 PC4 PC5
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 18.3 0.730 0.412 0.495 0.333 0.253
## 2 17.6 0.617 -1.41 -0.118 -0.466 0.815
## 3 17.2 0.761 -1.10 0.0550 -0.397 0.747
## 4 18.9 0.0400 -0.950 -0.158 0.405 -0.143
## 5 20.5 0.792 0.732 -0.204 0.465 -0.148
## 6 18.5 0.433 0.127 0.354 -0.0168 -0.0888
In general, the workflow interface to recipes is recommended for most
applications.