partition: Manually Partition into Training and Test Set

Description

Creates a split of the row ids of a Task into a training set and a test set while optionally stratifying on the target column.

For more complex partitions, see the example.

Usage

partition(task, ratio = 0.67, stratify = TRUE, ...)
# S3 method for TaskRegr
partition(task, ratio = 0.67, stratify = TRUE, bins = 3L, ...)
# S3 method for TaskClassif
partition(task, ratio = 0.67, stratify = TRUE, ...)

Arguments

task: (Task)
Task to operate on.
ratio: (numeric(1))
Ratio of observations to put into the training set.
stratify: (logical(1))
If TRUE, stratify on the target variable. For regression tasks, the target variable is first cut into bins bins. See Task$add_strata().
...: (any)
Additional arguments, currently not used.
bins: (integer(1))
Number of bins to cut the target variable into for stratification.

Examples

Run this code

# regression task
task = tsk("boston_housing")

# roughly equal size split while stratifying on the binned response
split = partition(task, ratio = 0.5)
data = data.frame(
  y = c(task$truth(split$train), task$truth(split$test)),
  split = rep(c("train", "predict"), lengths(split))
)
boxplot(y ~ split, data = data)


# classification task
task = tsk("pima")
split = partition(task)

# roughly same distribution of the target label
prop.table(table(task$truth()))
prop.table(table(task$truth(split$train)))
prop.table(table(task$truth(split$test)))


# splitting into 3 disjunct sets, using ResamplingCV and stratification
task = tsk("iris")
task$set_col_roles(task$target_names, add_to = "stratum")
r = rsmp("cv", folds = 3)$instantiate(task)

sets = lapply(1:3, r$train_set)
lengths(sets)
prop.table(table(task$truth(sets[[1]])))

Run the code above in your browser using DataLab