collapse_groups: Collapse groups with categorical, numerical, ID, and size balancing

Description

lifecycle::badge("experimental")

Collapses a set of groups into a smaller set of groups.

Attempts to balance the new groups by specified numerical columns, categorical columns, level counts in ID columns, and/or the number of rows (size).

Note: The more of these you balance at a time, the less balanced each of them may become. While, on average, the balancing work better than without, this is not guaranteed on every run. Enabling `auto_tune` can yield a much better overall balance than without in most contexts. This generates a larger set of group columns using all combinations of the balancing columns and selects the most balanced group column(s). This is slower and we recommend enabling parallelization (see `parallel`).

While this balancing algorithm will not be optimal in all cases, it allows balancing a large number of columns at once. Especially with auto-tuning enabled, this can be very powerful.

Tip: Check the balances of the new groups with summarize_balances() and ranked_balances().

Note: The categorical and ID balancing algorithms are different to those in fold() and partition().

Usage

collapse_groups(
  data,
  n,
  group_cols,
  cat_cols = NULL,
  cat_levels = NULL,
  num_cols = NULL,
  id_cols = NULL,
  balance_size = TRUE,
  auto_tune = FALSE,
  weights = NULL,
  method = "balance",
  group_aggregation_fn = mean,
  num_new_group_cols = 1,
  unique_new_group_cols_only = TRUE,
  max_iters = 5,
  extreme_pairing_levels = 1,
  combine_method = "avg_standardized",
  col_name = ".coll_groups",
  parallel = FALSE,
  verbose = TRUE
)

Value

data.frame with one or more new grouping factors.

Arguments

data

data.frame. Can be grouped, in which case the function is applied group-wise.

n

Number of new groups.

When `num_new_group_cols` > 1, `n` can also be a vector with one `n` per new group column. This allows trying multiple `n` settings at a time. Note that the generated group columns are not guaranteed to be in the order of `n`.

group_cols

Names of factors in `data` for identifying the existing groups that should be collapsed.

Multiple names are treated as in dplyr::group_by() (i.e., a hierarchy of groups), where each leaf group within each parent group is considered a unique group to be collapsed. Parent groups are not considered during collapsing, why leaf groups from different parent groups can be collapsed together.

Note: Do not confuse these group columns with potential columns that `data` is grouped by. `group_cols` identifies the groups to be collapsed. When `data` is grouped with dplyr::group_by(), the function is applied separately to each of those subsets.

cat_cols

Names of categorical columns to balance the average frequency of one or more levels of.

cat_levels

Names of the levels in the `cat_cols` columns to balance the average frequencies of. When `NULL` (default), all levels are balanced. Can be weights indicating the balancing importance of each level (within each column).

The weights are automatically scaled to sum to 1.

Can be ".minority" or ".majority", in which case the minority/majority level are found and used.

When `cat_cols` has single column name:

Either a vector with level names or a named numeric vector with weights:

E.g. c("dog", "pidgeon", "mouse") or c("dog" = 5, "pidgeon" = 1, "mouse" = 3)

When `cat_cols` has multiple column names:

A named list with vectors for each column name in `cat_cols`. When not providing a vector for a `cat_cols` column, all levels are balanced in that column.

E.g. list("col1" = c("dog" = 5, "pidgeon" = 1, "mouse" = 3), "col2" = c("hydrated", "dehydrated")).

num_cols

Names of numerical columns to balance between groups.

id_cols

Names of factor columns with IDs to balance the counts of between groups.

E.g. useful to get a similar number of participants in each group.

balance_size

Whether to balance the size of the collapsed groups. (logical)

auto_tune

Whether to create a larger set of collapsed group columns from all combinations of the balancing dimensions and select the overall most balanced group column(s).

This tends to create much more balanced collapsed group columns.

Can be slow, why we recommend enabling parallelization (see `parallel`).

weights

Named vector with balancing importance weights for each of the balancing columns. Besides the columns in `cat_cols`, `num_cols`, and `id_cols`, the size balancing weight can be given as "size".

The weights are automatically scaled to sum to 1.

Dimensions that are not given a weight is automatically given the weight 1.

E.g. c("size" = 1, "cat" = 1, "num1" = 4, "num2" = 7, "id" = 2).

method

"balance", "ascending", or "descending":

After calculating a combined balancing column from each of the balancing columns (see Details >> Balancing columns):

"balance" balances the combined balancing column between the groups.
"ascending" orders the combined balancing column and groups from the lowest to highest value.
"descending" orders the combined balancing column and groups from the highest to lowest value.

group_aggregation_fn

Function for aggregating values in the `num_cols` columns for each group in `group_cols`.

Default is mean(), where the average value(s) are balanced across the new groups.

When using sum(), the groups will have similar sums across the new groups.

N.B. Only used when `num_cols` is specified.

num_new_group_cols

Number of group columns to create.

When `num_new_group_cols` > 1, columns are named with a combination of `col_name` and "_1", "_2", etc. E.g. \(".coll_groups_1"\), \(".coll_groups_2"\), ...

N.B. When `unique_new_group_cols_only` is `TRUE`, we may end up with fewer columns than specified, see `max_iters`.

unique_new_group_cols_only

Whether to only return unique new group columns.

As the number of column comparisons can be quite time consuming, we recommend enabling parallelization. See `parallel`.

N.B. We can end up with fewer columns than specified in `num_new_group_cols`, see `max_iters`.

N.B. Only used when `num_new_group_cols` > 1.

max_iters

Maximum number of attempts at reaching `num_new_group_cols` unique new group columns.

When only keeping unique new group columns, we risk having fewer columns than expected. Hence, we repeatedly create the missing columns and remove those that are not unique. This is done until we have `num_new_group_cols` unique group columns or we have attempted `max_iters` times.

In some cases, it is not possible to create `num_new_group_cols` unique combinations of the dataset. `max_iters` specifies when to stop trying. Note that we can end up with fewer columns than specified in `num_new_group_cols`.

N.B. Only used when `num_new_group_cols` > 1.

extreme_pairing_levels

How many levels of extreme pairing to do when balancing the groups by the combined balancing column (see Details).

Extreme pairing: Rows/pairs are ordered as smallest, largest, second smallest, second largest, etc. If extreme_pairing_levels > 1, this is done "recursively" on the extreme pairs.

N.B. Larger values work best with large datasets. If set too high, the result might not be stochastic. Always check if an increase actually makes the groups more balanced.

combine_method

Method to combine the balancing columns by. One of "avg_standardized" or "avg_min_max_scaled".

For each balancing column (all columns in num_cols, cat_cols, and id_cols, plus size), we calculate a normalized, numeric group summary column, which indicates the "size" of each group in that dimension. These are then combined to a single combined balancing column.

The three steps are:

Calculate a numeric representation of the balance for each column. E.g. the number of unique levels within each group of an ID column (see Details > Balancing columns for more on this).
Normalize each column separately with standardization ("avg_standardized"; Default) or MinMax scaling to the [0, 1] range ("avg_min_max_scaled").
Average the columns rowwise to get a single column with one value per group. The averaging is weighted by `weights`, which is useful when one of the dimensions is more important to get a good balance of.

`combine_method` chooses whether to use standardization or MinMax scaling in step 2.

col_name

Name of the new group column. When creating multiple new group columns (`num_new_group_cols`>1), this is the prefix for the names, which will be suffixed with an underscore and a number (_1, _2, _3, etc.).

parallel

Whether to parallelize the group column comparisons when `unique_new_group_cols_only` is `TRUE`.

Especially highly recommended when `auto_tune` is enabled.

Requires a registered parallel backend. Like doParallel::registerDoParallel.

verbose

Whether to print information about the process. May make the function slightly slower.

N.B. Currently only used during auto-tuning.

Author

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

Details

The goal of collapse_groups() is to combine existing groups to a lower number of groups while (optionally) balancing one or more numeric, categorical and/or ID columns, along with the group size.

For each of these columns (and size), we calculate a normalized, numeric "balancing column" that when balanced between the groups lead to its original column being balanced as well.

To balance multiple columns at once, we combine their balancing columns with weighted averaging (see `combine_method` and `weights`) to a single combined balancing column.

Finally, we create groups where this combined balancing column is balanced between the groups, using the numerical balancing in fold().

Auto-tuning

This strategy is not guaranteed to produce balanced groups in all contexts, e.g. when the balancing columns cancel out. To increase the probability of balanced groups, we can produce multiple group columns with all combinations of the balancing columns and select the overall most balanced group column(s). We refer to this as auto-tuning (see `auto_tune`).

We find the overall most balanced group column by ranking the across-group standard deviations for each of the balancing columns, as found with summarize_balances().

Example of finding the overall most balanced group column(s):

Given a group column with the following average age per group: `c(16, 18, 25, 21)`, the standard deviation hereof (3.92) is a measure of how balanced the age column is. Another group column can thus have a lower/higher standard deviation and be considered more/less balanced.

We find the rankings of these standard deviations for all the balancing columns and average them (again weighted by `weights`). We select the group column(s) with the, on average, highest rank (i.e. lowest standard deviations).

Checking balances

We highly recommend using summarize_balances() and ranked_balances() to check how balanced the created groups are on the various dimensions. When applying ranked_balances() to the output of summarize_balances(), we get a data.frame with the standard deviations for each balancing dimension (lower means more balanced), ordered by the average rank (see Examples).

Balancing columns

The following describes the creation of the balancing columns for each of the supported column types:

cat_cols

For each column in `cat_cols`:

Count each level within each group. This creates a data.frame with one count column per level, with one row per group.
Standardize the count columns.
Average the standardized counts rowwise to create one combined column representing the balance of the levels for each group. When cat_levels contains weights for each of the levels, we apply weighted averaging.

Example: Consider a factor column with the levels c("A", "B", "C"). We count each level per group, normalize the counts and combine them with weighted averaging:

Group	A	B	C	->	nA	nB	nC	->	Combined
1	5	57	1	\|	0.24	0.55	-0.77	\|	0.007
2	7	69	2	\|	0.93	0.64	-0.77	\|	0.267
3	2	34	14	\|	-1.42	0.29	1.34	\|	0.07
4	5	0	4	\|	0.24	-1.48	0.19	\|	-0.35
...	...	...	...	\|	...	...	...	\|	...

id_cols

For each column in `id_cols`:

Count the unique IDs (levels) within each group. (Note: The same ID can be counted in multiple groups.)

num_cols

For each column in `num_cols`:

Aggregate the numeric columns by group using the `group_aggregation_fn`.

size

Count the number of rows per group.

Combining balancing columns

Apply standardization or MinMax scaling to each of the balancing columns (see `combine_method`).
Perform weighted averaging to get a single balancing column (see `weights`).

Example: We apply standardization and perform weighted averaging:

Group	Size	Num	Cat	ID	->	nSize	nNum	nCat	nID	->	Combined
1	34	1.3	0.007	3	\|	-0.33	-0.82	0.03	-0.46	\|	-0.395
2	23	4.6	0.267	4	\|	-1.12	0.34	1.04	0.0	\|	0.065
3	56	7.2	0.07	7	\|	1.27	1.26	0.28	1.39	\|	1.05
4	41	1.4	-0.35	2	\|	0.18	-0.79	-1.35	-0.93	\|	-0.723
...	...	...	...	...	\|	...	...	...	...	\|	...

Creating the groups

Finally, we get to the group creation. There are three methods for creating groups based on the combined balancing column: "balance" (default), "ascending", and "descending".

`method` is "balance"

To create groups that are balanced by the combined balancing column, we use the numerical balancing in fold().

The following describes the numerical balancing in broad terms:

Rows are shuffled. Note that this will only affect rows with the same value in the combined balancing column.
Extreme pairing 1: Rows are ordered as smallest, largest, second smallest, second largest, etc. Each small+large pair get an extreme-group identifier. (See rearrr::pair_extremes())
If `extreme_pairing_levels` > 1: These extreme-group identifiers are reordered as smallest, largest, second smallest, second largest, etc., by the sum of the combined balancing column in the represented rows. These pairs (of pairs) get a new set of extreme-group identifiers, and the process is repeated `extreme_pairing_levels`-2 times. Note that the extreme-group identifiers at the last level will represent 2^`extreme_pairing_levels` rows, why you should be careful when choosing a larger setting.
The extreme-group identifiers from the last pairing are randomly divided into the final groups and these final identifiers are transferred to the original rows.

N.B. When doing extreme pairing of an unequal number of rows, the row with the smallest value is placed in a group by itself, and the order is instead: (smallest), (second smallest, largest), (third smallest, second largest), etc.

A similar approach with extreme triplets (i.e. smallest, closest to median, largest, second smallest, second closest to median, second largest, etc.) may also be utilized in some scenarios. (See rearrr::triplet_extremes())

Example: We order the data.frame by smallest "Num" value, largest "Num" value, second smallest, and so on. We could further (when `extreme_pairing_levels` > 1) find the sum of "Num" for each pair and perform extreme pairing on the pairs. Finally, we group the data.frame:

Group	Num	->	Group	Num	Pair	->	New group	1	-0.395	\|	5
-1.23	1	\|	3	2	0.065	\|	3	1.05	1	\|	3
3	1.05	\|	4	-0.723	2	\|	1	4	-0.723	\|	2
0.065	2	\|	1	5	-1.23	\|	1	-0.395	3	\|	2
6	-0.15	\|	6	-0.15	3	\|	2	...	...	\|	...

`method` is "ascending" or "descending"

These methods order the data by the combined balancing column and creates groups such that the sums get increasingly larger (`ascending`) or smaller (`descending`). This will in turn lead to a pattern of increasing/decreasing sums in the balancing columns (e.g. increasing/decreasing counts of the categorical levels, counts of IDs, number of rows and sums of numeric columns).

Examples

Run this code

# Attach packages
library(groupdata2)
library(dplyr)

# Set seed
if (requireNamespace("xpectr", quietly = TRUE)){
  xpectr::set_test_seed(42)
}

# Create data frame
df <- data.frame(
  "participant" = factor(rep(1:20, 3)),
  "age" = rep(sample(c(1:100), 20), 3),
  "answer" = factor(sample(c("a", "b", "c", "d"), 60, replace = TRUE)),
  "score" = sample(c(1:100), 20 * 3)
)
df <- df %>% dplyr::arrange(participant)
df$session <- rep(c("1", "2", "3"), 20)

# Sample rows to get unequal sizes per participant
df <- dplyr::sample_n(df, size = 53)

# Create the initial groups (to be collapsed)
df <- fold(
  data = df,
  k = 8,
  method = "n_dist",
  id_col = "participant"
)

# Ungroup the data frame
# Otherwise `collapse_groups()` would be
# applied to each fold separately!
df <- dplyr::ungroup(df)

# NOTE: Make sure to check the examples with `auto_tune`
# in the end, as this is where the magic lies

# Collapse to 3 groups with size balancing
# Creates new `.coll_groups` column
df_coll <- collapse_groups(
  data = df,
  n = 3,
  group_cols = ".folds",
  balance_size = TRUE # enabled by default
)

# Check balances
(coll_summary <- summarize_balances(
  data = df_coll,
  group_cols = ".coll_groups",
  cat_cols = 'answer',
  num_cols = c('score', 'age'),
  id_cols = 'participant'
))

# Get ranked balances
# NOTE: When we only have a single new group column
# we don't get ranks - but this is good to use
# when comparing multiple group columns!
# The scores are standard deviations across groups
ranked_balances(coll_summary)

# Collapse to 3 groups with size + *categorical* balancing
# We create 2 new `.coll_groups_1/2` columns
df_coll <- collapse_groups(
  data = df,
  n = 3,
  group_cols = ".folds",
  cat_cols = "answer",
  balance_size = TRUE,
  num_new_group_cols = 2
)

# Check balances
# To simplify the output, we only find the
# balance of the `answer` column
(coll_summary <- summarize_balances(
  data = df_coll,
  group_cols = paste0(".coll_groups_", 1:2),
  cat_cols = 'answer'
))

# Get ranked balances
# All scores are standard deviations across groups or (average) ranks
# Rows are ranked by most to least balanced
# (i.e. lowest average SD rank)
ranked_balances(coll_summary)

# Collapse to 3 groups with size + categorical + *numerical* balancing
# We create 2 new `.coll_groups_1/2` columns
df_coll <- collapse_groups(
  data = df,
  n = 3,
  group_cols = ".folds",
  cat_cols = "answer",
  num_cols = "score",
  balance_size = TRUE,
  num_new_group_cols = 2
)

# Check balances
(coll_summary <- summarize_balances(
  data = df_coll,
  group_cols = paste0(".coll_groups_", 1:2),
  cat_cols = 'answer',
  num_cols = 'score'
))

# Get ranked balances
# All scores are standard deviations across groups or (average) ranks
ranked_balances(coll_summary)

# Collapse to 3 groups with size and *ID* balancing
# We create 2 new `.coll_groups_1/2` columns
df_coll <- collapse_groups(
  data = df,
  n = 3,
  group_cols = ".folds",
  id_cols = "participant",
  balance_size = TRUE,
  num_new_group_cols = 2
)

# Check balances
# To simplify the output, we only find the
# balance of the `participant` column
(coll_summary <- summarize_balances(
  data = df_coll,
  group_cols = paste0(".coll_groups_", 1:2),
  id_cols = 'participant'
))

# Get ranked balances
# All scores are standard deviations across groups or (average) ranks
ranked_balances(coll_summary)

###################
#### Auto-tune ####

# As you might have seen, the balancing does not always
# perform as optimal as we might want or need
# To get a better balance, we can enable `auto_tune`
# which will create a larger set of collapsings
# and select the most balanced new group columns
# While it is not required, we recommend
# enabling parallelization

if (FALSE) {
# Uncomment for parallelization
# library(doParallel)
# doParallel::registerDoParallel(7) # use 7 cores

# Collapse to 3 groups with lots of balancing
# We enable `auto_tune` to get a more balanced set of columns
# We create 10 new `.coll_groups_1/2/...` columns
df_coll <- collapse_groups(
  data = df,
  n = 3,
  group_cols = ".folds",
  cat_cols = "answer",
  num_cols = "score",
  id_cols = "participant",
  balance_size = TRUE,
  num_new_group_cols = 10,
  auto_tune = TRUE,
  parallel = FALSE # Set to TRUE for parallelization!
)

# Check balances
# To simplify the output, we only find the
# balance of the `participant` column
(coll_summary <- summarize_balances(
  data = df_coll,
  group_cols = paste0(".coll_groups_", 1:10),
  cat_cols = "answer",
  num_cols = "score",
  id_cols = 'participant'
))

# Get ranked balances
# All scores are standard deviations across groups or (average) ranks
ranked_balances(coll_summary)

# Now we can choose the .coll_groups_* column(s)
# that we favor the balance of
# and move on with our lives!
}

Run the code above in your browser using DataLab