Down-sampling is intended to be performed on the training set
alone. For this reason, the default is skip = TRUE
. It is
advisable to use prep(recipe, retain = TRUE)
when preparing
the recipe; in this way juice()
can be used to obtain the
down-sampled version of the data.
If there are missing values in the factor variable that is used
to define the sampling, missing data are selected at random in
the same way that the other factor levels are sampled. Missing
values are not used to determine the amount of data in the
minority level
For any data with factor levels occurring with the same
frequency as the minority level, all data will be retained.
All columns in the data are sampled and returned by juice()
and bake()
.
Keep in mind that the location of down-sampling in the step
may have effects. For example, if centering and scaling,
it is not clear whether those operations should be conducted
before or after rows are removed.