Learn R Programming

hardhat (version 0.1.1)

default_xy_blueprint: Default XY blueprint

Description

This pages holds the details for the XY preprocessing blueprint. This is the blueprint used by default from mold() if x and y are provided separately (i.e. the XY interface is used).

Usage

default_xy_blueprint(intercept = FALSE, allow_novel_levels = FALSE)

# S3 method for data.frame mold(x, y, ..., blueprint = NULL)

# S3 method for matrix mold(x, y, ..., blueprint = NULL)

Arguments

intercept

A logical. Should an intercept be included in the processed data? This information is used by the process function in the mold and forge function list.

allow_novel_levels

A logical. Should novel factor levels be allowed at prediction time? This information is used by the clean function in the forge function list, and is passed on to scream().

x

A data frame or matrix containing the predictors.

y

A data frame, matrix, or vector containing the outcomes.

...

Not used.

blueprint

A preprocessing blueprint. If left as NULL, then a default_xy_blueprint() is used.

Value

For default_xy_blueprint(), an XY blueprint.

Mold

When mold() is used with the default xy blueprint:

  • It converts x to a tibble.

  • It adds an intercept column to x if intercept = TRUE.

  • It runs standardize() on y.

Forge

When forge() is used with the default xy blueprint:

  • It calls shrink() to trim new_data to only the required columns and coerce new_data to a tibble.

  • It calls scream() to perform validation on the structure of the columns of new_data.

  • It adds an intercept column onto new_data if intercept = TRUE.

Details

As documented in standardize(), if y is a vector, then the returned outcomes tibble has 1 column with a standardized name of ".outcome".

The one special thing about the XY method's forge function is the behavior of outcomes = TRUE when a vector y value was provided to the original call to mold(). In that case, mold() converts y into a tibble, with a default name of .outcome. This is the column that forge() will look for in new_data to preprocess. See the examples section for a demonstration of this.

Examples

Run this code
# NOT RUN {
# ---------------------------------------------------------------------------
# Setup

train <- iris[1:100,]
test <- iris[101:150,]

train_x <- train[, "Sepal.Length", drop = FALSE]
train_y <- train[, "Species", drop = FALSE]

test_x <- test[, "Sepal.Length", drop = FALSE]
test_y <- test[, "Species", drop = FALSE]

# ---------------------------------------------------------------------------
# XY Example

# First, call mold() with the training data
processed <- mold(train_x, train_y)

# Then, call forge() with the blueprint and the test data
# to have it preprocess the test data in the same way
forge(test_x, processed$blueprint)

# ---------------------------------------------------------------------------
# Intercept

processed <- mold(train_x, train_y, blueprint = default_xy_blueprint(intercept = TRUE))

forge(test_x, processed$blueprint)

# ---------------------------------------------------------------------------
# XY Method and forge(outcomes = TRUE)

# You can request that the new outcome columns are preprocessed as well, but
# they have to be present in `new_data`!

processed <- mold(train_x, train_y)

# Can't do this!
try(forge(test_x, processed$blueprint, outcomes = TRUE))

# Need to use the full test set, including `y`
forge(test, processed$blueprint, outcomes = TRUE)

# With the XY method, if the Y value used in `mold()` is a vector,
# then a column name of `.outcome` is automatically generated.
# This name is what forge() looks for in `new_data`.

# Y is a vector!
y_vec <- train_y$Species

processed_vec <- mold(train_x, y_vec)

# This throws an informative error that tell you
# to include an `".outcome"` column in `new_data`.
try(forge(iris, processed_vec$blueprint, outcomes = TRUE))

test2 <- test
test2$.outcome <- test2$Species
test2$Species <- NULL

# This works, and returns a tibble in the $outcomes slot
forge(test2, processed_vec$blueprint, outcomes = TRUE)

# }

Run the code above in your browser using DataLab