fold: Create balanced folds for cross-validation.

Description

Divides data into groups by a range of methods. Balances a given categorical variable between folds and keeps (if possible) all data points with a shared ID (e.g. participant_id) in the same fold.

Usage

fold(data, k = 5, cat_col = NULL, id_col = NULL, starts_col = NULL,
  method = "n_dist", remove_missing_starts = FALSE)

Arguments

data

Dataframe or Vector.

Dependent on method.

Number of folds (default), fold size, with more (see method).

Given as whole number(s) and/or percentage(s) (0 < n < 1).

cat_col

Categorical variable to balance between folds.

E.g. when predicting a binary variable (a or b), it is necessary to have both represented in every fold

N.B. If also passing an id_col, cat_col should be constant within each ID.

id_col

Factor with IDs. This will be used to keep all rows that share an ID in the same fold (if possible).

E.g. If we have measured a participant multiple times and want to see the effect of time, we want to have all observations of this participant in the same fold.

starts_col

Name of column with values to match in method l_starts when data is a dataframe. Pass 'index' to use row names. (Character)

method

greedy, n_dist, n_fill, n_last, n_rand, l_sizes, l_starts, staircase, or primes.

Notice: examples are sizes of the generated groups based on a vector with 57 elements.

greedy

Divides up the data greedily given a specified group size \((e.g. 10, 10, 10, 10, 10, 7)\).

n is group size

n_dist (default)

Divides the data into a specified number of groups and distributes excess data points across groups \((e.g. 11, 11, 12, 11, 12)\).

n is number of groups

n_fill

Divides the data into a specified number of groups and fills up groups with excess data points from the beginning \((e.g. 12, 12, 11, 11, 11)\).

n is number of groups

n_last

Divides the data into a specified number of groups. It finds the most equal group sizes possible, using all data points. Only the last group is able to differ in size \((e.g. 11, 11, 11, 11, 13)\).

n is number of groups

n_rand

Divides the data into a specified number of groups. Excess data points are placed randomly in groups (only 1 per group) \((e.g. 12, 11, 11, 11, 12)\).

n is number of groups

l_sizes

Divides up the data by a list of group sizes. Excess data points are placed in an extra group at the end. \((e.g. n = list(0.2,0.3) outputs groups with sizes (11,17,29))\).

n is a list of group sizes

l_starts

Starts new groups at specified values of vector.

n is a list of starting positions. Skip values by c(value, skip_to_number) where skip_to_number is the nth appearance of the value in the vector. Groups automatically start from first data point.

\(E.g. n = c(1,3,7,25,50) outputs groups with sizes (2,4,18,25,8)\).

To skip: \(given vector c("a", "e", "o", "a", "e", "o"), n = list("a", "e", c("o", 2)) outputs groups with sizes (1,4,1)\).

If passing \(n = 'auto'\) the starting positions are automatically found with find_starts().

staircase

Uses step size to divide up the data. Group size increases with 1 step for every group, until there is no more data \((e.g. 5, 10, 15, 20, 7)\).

n is step size

primes

Uses prime numbers as group sizes. Group size increases to the next prime number until there is no more data. \((e.g. 5, 7, 11, 13, 17, 4)\).

n is the prime number to start at

remove_missing_starts

Recursively remove elements from the list of starts that are not found. For method l_starts only. (Logical)

Value

Dataframe with grouping factor for subsetting in cross-validation.

Details

cat_col: data is first subset by cat_col. Subsets are folded/grouped and merged.

id_col: folds are created from unique IDs.

cat_col AND id_col: data is subset by cat_col and folds are created from unique IDs in each subset. Subsets are merged.

Examples

Run this code

# NOT RUN {
# Attach packages
library(groupdata2)
library(dplyr)

# Create dataframe
df <- data.frame(
 "participant" = factor(rep(c('1','2', '3', '4', '5', '6'), 3)),
 "age" = rep(sample(c(1:100), 6), 3),
 "diagnosis" = rep(c('a', 'b', 'a', 'a', 'b', 'b'), 3),
 "score" = sample(c(1:100), 3*6))
df <- df[order(df$participant),]
df$session <- rep(c('1','2', '3'), 6)

# Using fold()
# Without cat_col and id_col
df_folded <- fold(df, 3, method = 'n_dist')

# With cat_col
df_folded <- fold(df, 3, cat_col = 'diagnosis',
 method = 'n_dist')

# With id_col
df_folded <- fold(df, 3, id_col = 'participant',
 method = 'n_dist')

# With cat_col and id_col
df_folded <- fold(df, 3, cat_col = 'diagnosis',
 id_col = 'participant', method = 'n_dist')

# Order by folds
df_folded <- df_folded[order(df_folded$.folds),]

# }

Run the code above in your browser using DataLab