- data
data.frame
. Can be grouped, in which case
the function is applied group-wise.
- k
Depends on `method`
.
Number of folds (default), fold size, with more (see `method`
).
When `num_fold_cols` > 1
, `k`
can also be a vector
with one `k`
per fold column. This allows trying multiple `k`
settings at a time. Note
that the generated fold columns are not guaranteed to be in the order of `k`
.
Given as whole number or percentage (0 < `k` < 1
).
- cat_col
Name of categorical variable to balance between folds.
E.g. when predicting a binary variable (a or b), we usually want
both classes represented in every fold.
N.B. If also passing an `id_col`
, `cat_col`
should be constant within each ID.
- num_col
Name of numerical variable to balance between folds.
N.B. When used with `id_col`
, values for each ID are aggregated using
`id_aggregation_fn`
before being balanced.
N.B. When passing `num_col`
, the `method`
parameter is ignored.
- id_col
Name of factor with IDs.
This will be used to keep all rows that share an ID in the same fold
(if possible).
E.g. If we have measured a participant multiple times and want to see the
effect of time, we want to have all observations of this participant in
the same fold.
N.B. When `data`
is a grouped data.frame
(see dplyr::group_by()
), IDs that appear in multiple
groupings might end up in different folds in those groupings.
- method
"n_dist"
, "n_fill"
, "n_last"
,
"n_rand"
, "greedy"
, or "staircase"
.
Notice: examples are sizes of the generated groups
based on a vector with 57
elements.
n_dist (default)
Divides the data into a specified number of groups and
distributes excess data points across groups
\((e.g. 11, 11, 12, 11, 12)\).
`k`
is number of groups
n_fill
Divides the data into a specified number of groups and
fills up groups with excess data points from the beginning
\((e.g. 12, 12, 11, 11, 11)\).
`k`
is number of groups
n_last
Divides the data into a specified number of groups.
It finds the most equal group sizes possible,
using all data points. Only the last group is able to differ in size
\((e.g. 11, 11, 11, 11, 13)\).
`k`
is number of groups
n_rand
Divides the data into a specified number of groups.
Excess data points are placed randomly in groups (only 1 per group)
\((e.g. 12, 11, 11, 11, 12)\).
`k`
is number of groups
greedy
Divides up the data greedily given a specified group size
\((e.g. 10, 10, 10, 10, 10, 7)\).
`k`
is group size
staircase
Uses step size to divide up the data.
Group size increases with 1 step for every group,
until there is no more data
\((e.g. 5, 10, 15, 20, 7)\).
`k`
is step size
- id_aggregation_fn
Function for aggregating values in `num_col`
for each ID, before balancing `num_col`
.
N.B. Only used when `num_col`
and `id_col`
are both specified.
- extreme_pairing_levels
How many levels of extreme pairing to do
when balancing folds by a numerical column (i.e. `num_col`
is specified).
Extreme pairing: Rows/pairs are ordered as smallest, largest,
second smallest, second largest, etc. If extreme_pairing_levels > 1
,
this is done "recursively" on the extreme pairs. See `Details/num_col`
for more.
N.B. Larger values work best with large datasets. If set too high,
the result might not be stochastic. Always check if an increase
actually makes the folds more balanced. See example.
- num_fold_cols
Number of fold columns to create.
Useful for repeated cross-validation.
If num_fold_cols > 1
, columns will be named
\(".folds_1"\), \(".folds_2"\), etc.
Otherwise simply \(".folds"\).
N.B. If `unique_fold_cols_only`
is TRUE
,
we can end up with fewer columns than specified, see `max_iters`
.
N.B. If `data`
has existing fold columns, see `handle_existing_fold_cols`
.
- unique_fold_cols_only
Check if fold columns are identical and
keep only unique columns.
As the number of column comparisons can be time consuming,
we can run this part in parallel. See `parallel`
.
N.B. We can end up with fewer columns than specified in
`num_fold_cols`
, see `max_iters`
.
N.B. Only used when `num_fold_cols` > 1
or `data`
has existing fold columns.
- max_iters
Maximum number of attempts at reaching
`num_fold_cols`
unique fold columns.
When only keeping unique fold columns, we risk having fewer columns than expected.
Hence, we repeatedly create the missing columns and remove those that are not unique.
This is done until we have `num_fold_cols`
unique fold columns
or we have attempted `max_iters`
times.
In some cases, it is not possible to create `num_fold_cols`
unique combinations of the dataset, e.g.
when specifying `cat_col`
, `id_col`
and `num_col`
.
`max_iters`
specifies when to stop trying.
Note that we can end up with fewer columns than specified in `num_fold_cols`
.
N.B. Only used when `num_fold_cols` > 1
.
- use_of_triplets
"fill"
, "instead"
or "never"
.
When to use extreme triplet grouping in numerical balancing (when `num_col`
is specified).
fill (default)
When extreme pairing cannot create enough unique fold columns, use extreme triplet grouping
to create additional unique fold columns.
instead
Use extreme triplet grouping instead of extreme pairing. For some datasets, grouping in triplets
give better balancing than grouping in pairs. This can be worth exploring when
numerical balancing is important.
Tip: Compare the balances with summarize_balances()
and
ranked_balances()
.
never
Never use extreme triplet grouping.
Extreme triplet grouping
Similar to extreme pairing (see Details >> num_col
), extreme triplet grouping
orders the rows as smallest, closest to the median, largest, second smallest, second
closest to the median, second largest, etc. Each triplet gets a group identifier
and we either perform recursive extreme triplet grouping on the identifiers or fold
the identifiers and transfer the fold IDs to the original rows.
For some datasets, this can be give more balanced groups than extreme pairing, but
on average, extreme pairing works better. Due to the grouping into triplets instead of pairs
they tend to create different groupings though, so when creating many fold columns
and extreme pairing cannot create enough unique fold columns, we can create the remaining
(or at least some additional number) with extreme triplet grouping.
Extreme triplet grouping is implemented in
rearrr::triplet_extremes()
.
- handle_existing_fold_cols
How to handle existing fold columns.
Either "keep_warn"
, "keep"
, or "remove"
.
To add extra fold columns, use "keep"
or "keep_warn"
.
Note that existing fold columns might be renamed.
To replace the existing fold columns, use "remove"
.
- parallel
Whether to parallelize the fold column comparisons,
when `unique_fold_cols_only`
is TRUE
.
Requires a registered parallel backend.
Like doParallel::registerDoParallel
.