partition: Create balanced partitions.

Description

Splits data into partitions. Balances a given categorical variable between partitions and keeps (if possible) all data points with a shared ID (e.g. participant_id) in the same partition.

Usage

partition(data, p = 0.2, cat_col = NULL, id_col = NULL,
  force_equal = FALSE, list_out = TRUE)

Arguments

data

Dataframe or Vector.

List / vector of partition sizes. Given as whole number(s) and/or percentage(s) (0 < n < 1). E.g. \(c(0.2, 3, 0.1)\).

cat_col

Categorical variable to balance between partitions.

E.g. when training/testing a model for predicting a binary variable (a or b), it is necessary to have both represented in both the training set and the test set.

N.B. If also passing an id_col, cat_col should be constant within each ID.

id_col

Factor with IDs. Used to keep all rows that share an ID in the same partition (if possible).

E.g. If we have measured a participant multiple times and want to see the effect of time, we want to have all observations of this participant in the same partition.

force_equal

Discard excess data. (Logical)

list_out

Return partitions in a list. (Logical)

Value

If list_out is TRUE:

A list of partitions where partitions are dataframes.

If list_out is FALSE:

A dataframe with grouping factor for subsetting.

Details

cat_col: data is first subset by cat_col. Subsets are grouped and merged.

id_col: groups are created from unique IDs.

cat_col AND id_col: data is subset by cat_col and groups are created from unique IDs in each subset. Subsets are merged.

Examples

Run this code

# NOT RUN {
# Attach packages
library(groupdata2)
library(dplyr)

# Create dataframe
df <- data.frame(
 "participant" = factor(rep(c('1','2', '3', '4', '5', '6'), 3)),
 "age" = rep(sample(c(1:100), 6), 3),
 "diagnosis" = rep(c('a', 'b', 'a', 'a', 'b', 'b'), 3),
 "score" = sample(c(1:100), 3*6))
df <- df[order(df$participant),]
df$session <- rep(c('1','2', '3'), 6)

# Using partition()
# Without cat_col and id_col
partitions <- partition(df, c(0.2,0.3))

# With cat_col
partitions <- partition(df, c(0.5), cat_col = 'diagnosis')

# With id_col
partitions <- partition(df, c(0.5), id_col = 'participant')

# With cat_col and id_col
partitions <- partition(df, c(0.5), cat_col = 'diagnosis',
                        id_col = 'participant')

# Return dataframe with grouping factor
# with list_out = FALSE
partitions <- partition(df, c(0.5), list_out = FALSE)

# }

Run the code above in your browser using DataLab