Learn R Programming

⚠️There's a newer version (2.0.5) of this package.Take me there.

groupdata2

R package: Subsetting methods for balanced cross-validation, time series windowing, and general grouping and splitting of data.

By Ludvig R. Olsen, Cognitive Science, Aarhus University. Started in Oct. 2016

Contact at: r-pkgs@ludvigolsen.dk

Main functions:

  • group_factor
  • group
  • splt
  • partition
  • fold

Other tools:

  • find_starts
  • %staircase%
  • %primes%

Installation

CRAN version:

install.packages("groupdata2")

Development version:

install.packages("devtools") devtools::install_github("LudvigOlsen/groupdata2")

Vignettes

groupdata2 contains a number of vignettes with relevant use cases and descriptions.

vignette(package='groupdata2') # for an overview vignette("introduction_to_groupdata2") # begin here

Functions

group_factor()

Returns a factor with group numbers, e.g. (1,1,1,2,2,2,3,3,3).

This can be used to subset, aggregate, group_by, etc.

Create equally sized groups by setting force_equal = TRUE

Randomize grouping factor by setting randomize = TRUE

group()

Returns the given data as a dataframe with added grouping factor made with group_factor(). The dataframe is grouped by the grouping factor for easy use with dplyr pipelines.

splt()

Creates the specified groups with group_factor() and splits the given data by the grouping factor with base::split. Returns the splits in a list.

partition()

Creates (optionally) balanced partitions (e.g. training/test sets). Balance partitions on one categorical variable and/or make sure that all datapoints sharing an ID is in the same partition.

fold()

Creates (optionally) balanced folds for use in cross-validation. Balance folds on one categorical variable and/or make sure that all datapoints sharing an ID is in the same fold.

Methods

There are currently 9 methods available. They can be divided into 5 categories.

Examples of group sizes are based on a vector with 57 elements.

Specify group size

Method: greedy

Divides up the data greedily given a specified group size.

E.g. group sizes: 10, 10, 10, 10, 10, 7

Specify number of groups

Method: n_dist (Default)

Divides the data into a specified number of groups and distributes excess data points across groups.

E.g. group sizes: 11, 11, 12, 11, 12

Method: n_fill

Divides the data into a specified number of groups and fills up groups with excess data points from the beginning.

E.g. group sizes: 12, 12, 11, 11, 11

Method: n_last

Divides the data into a specified number of groups. The algorithm finds the most equal group sizes possible, using all data points. Only the last group is able to differ in size.

E.g. group sizes: 11, 11, 11, 11, 13

Method: n_rand

Divides the data into a specified number of groups. Excess data points are placed randomly in groups (only 1 per group).

E.g. group sizes: 12, 11, 11, 11, 12

Specify list

Method: l_sizes

Uses a list / vector of group sizes to divide up the data. Excess data points are placed in an extra group.

E.g. n = c(11, 11) returns group sizes: 11, 11, 35

Method: l_starts

Uses a list of starting positions to divide up the data. Starting positions are values in a vector (e.g. column in dataframe). Skip to a specific nth appearance of a value by using c(value, skip_to).

E.g. n = c(11, 15, 27, 43) returns group sizes: 10, 4, 12, 16, 15

Identical to n = list(11, 15, c(27, 1), 43) where 1 specifies that we want the first appearance of 27 after the previous value 15.

If passing n = 'auto' starting posititions are automatically found with find_starts().

Specify step size

Method: staircase

Uses step_size to divide up the data. Group size increases with 1 step for every group, until there is no more data.

E.g. group sizes: 5, 10, 15, 20, 7

Specify start at

Method: primes

Creates groups with sizes corresponding to prime numbers. Starts at n (prime number). Increases to the the next prime number until there is no more data.

E.g. group sizes: 5, 7, 11, 13, 17, 4

Examples

# Attach packages
library(groupdata2)
library(dplyr)
library(knitr)
# Create dataframe
df <- data.frame("x"=c(1:12),
  "species" = rep(c('cat','pig', 'human'), 4),
  "age" = sample(c(1:100), 12))

group()

# Using group()
group(df, n = 5, method = 'n_dist') %>%
  kable()
xspeciesage.groups
1cat811
2pig641
3human482
4cat242
5pig603
6human13
7cat373
8pig744
9human764
10cat475
11pig835
12human685

# Using group() with dplyr pipeline to get mean age
df %>%
  group(n = 5, method = 'n_dist') %>%
  dplyr::summarise(mean_age = mean(age)) %>%
  kable()
.groupsmean_age
172.50000
236.00000
332.66667
475.00000
566.00000

# Using group() with 'l_starts' method
# Starts group at the first 'cat', 
# then skips to the second appearance of "pig" after "cat",
# then starts at the following "cat".
df %>%
  group(n = list("cat", c("pig",2), "cat"), 
        method = 'l_starts',
        starts_col = "species") %>%
  kable()
xspeciesage.groups
1cat811
2pig641
3human481
4cat241
5pig602
6human12
7cat373
8pig743
9human763
10cat473
11pig833
12human683

fold()

# Create dataframe
df <- data.frame(
  "participant" = factor(rep(c('1','2', '3', '4', '5', '6'), 3)),
  "age" = rep(c(20,23,27,21,32,31), 3),
  "diagnosis" = rep(c('a', 'b', 'a', 'b', 'b', 'a'), 3),
  "score" = c(10,24,15,35,24,14,24,40,30,50,54,25,45,67,40,78,62,30))
df <- df[order(df$participant),]
df$session <- rep(c('1','2', '3'), 6)
# Using fold() 

# First set seed to ensure reproducibility
set.seed(1)

# Use fold() with cat_col and id_col
df_folded <- fold(df, k = 3, cat_col = 'diagnosis',
                  id_col = 'participant', method = 'n_dist')

# Show df_folded ordered by folds
df_folded[order(df_folded$.folds),] %>%
  kable()
participantagediagnosisscoresession.folds
120a1011
120a2421
120a4531
421b3511
421b5021
421b7831
631a1412
631a2522
631a3032
532b2412
532b5422
532b6232
327a1513
327a3023
327a4033
223b2413
223b4023
223b6733

# Show distribution of diagnoses and participants
df_folded %>% 
  group_by(.folds) %>% 
  count(diagnosis, participant) %>% 
  kable()
.foldsdiagnosisparticipantn
1a13
1b43
2a63
2b53
3a33
3b23

Notice that the we now have the opportunity to include the session variable and/or use participant as a random effect in our model when doing cross-validation, as any participant will only appear in one fold.

We also have a balance in the representation of each diagnosis, which could give us better, more consistent results.

Copy Link

Version

Install

install.packages('groupdata2')

Monthly Downloads

2,198

Version

1.0.0

License

MIT + file LICENSE

Maintainer

Ludvig Renbo Olsen

Last Published

October 22nd, 2017

Functions in groupdata2 (1.0.0)

find_missing_starts

Find start positions that cannot be found in data.
find_starts

Find start positions of groups in data.
group_factor

Create grouping factor for subsetting your data.
groupdata2

groupdata2: A package for creating groups from data
partition

Create balanced partitions.
splt

Split data by a range of methods.
fold

Create balanced folds for cross-validation.
%primes%

Find remainder from primes method.
%staircase%

Find remainder from staircase method.
group

Create groups from your data.