group: Create groups from your data.

Description

Divides data into groups by a range of methods. Creates a grouping factor with 1s for group 1, 2s for group 2, etc. Returns a dataframe grouped by the grouping factor for easy use in dplyr pipelines.

Usage

group(data, n, method = "n_dist", starts_col = NULL, force_equal = FALSE,
  allow_zero = FALSE, return_factor = FALSE, descending = FALSE,
  randomize = FALSE, col_name = ".groups", remove_missing_starts = FALSE)

Arguments

data

Dataframe or Vector.

Dependent on method.

Number of groups (default), group size, list of group sizes, list of group starts, step size or prime number to start at. See method.

Passed as whole number(s) and/or percentage(s) (0 < n < 1) and/or character.

Method l_starts allows 'auto'.

method

greedy, n_dist, n_fill, n_last, n_rand, l_sizes, l_starts, staircase, or primes.

Notice: examples are sizes of the generated groups based on a vector with 57 elements.

greedy

Divides up the data greedily given a specified group size \((e.g. 10, 10, 10, 10, 10, 7)\).

n is group size

n_dist (default)

Divides the data into a specified number of groups and distributes excess data points across groups \((e.g. 11, 11, 12, 11, 12)\).

n is number of groups

n_fill

Divides the data into a specified number of groups and fills up groups with excess data points from the beginning \((e.g. 12, 12, 11, 11, 11)\).

n is number of groups

n_last

Divides the data into a specified number of groups. It finds the most equal group sizes possible, using all data points. Only the last group is able to differ in size \((e.g. 11, 11, 11, 11, 13)\).

n is number of groups

n_rand

Divides the data into a specified number of groups. Excess data points are placed randomly in groups (only 1 per group) \((e.g. 12, 11, 11, 11, 12)\).

n is number of groups

l_sizes

Divides up the data by a list of group sizes. Excess data points are placed in an extra group at the end. \((e.g. n = list(0.2,0.3) outputs groups with sizes (11,17,29))\).

n is a list of group sizes

l_starts

Starts new groups at specified values of vector.

n is a list of starting positions. Skip values by c(value, skip_to_number) where skip_to_number is the nth appearance of the value in the vector. Groups automatically start from first data point.

\(E.g. n = c(1,3,7,25,50) outputs groups with sizes (2,4,18,25,8)\).

To skip: \(given vector c("a", "e", "o", "a", "e", "o"), n = list("a", "e", c("o", 2)) outputs groups with sizes (1,4,1)\).

If passing \(n = 'auto'\) the starting positions are automatically found with find_starts().

staircase

Uses step size to divide up the data. Group size increases with 1 step for every group, until there is no more data \((e.g. 5, 10, 15, 20, 7)\).

n is step size

primes

Uses prime numbers as group sizes. Group size increases to the next prime number until there is no more data. \((e.g. 5, 7, 11, 13, 17, 4)\).

n is the prime number to start at

starts_col

Name of column with values to match in method l_starts when data is a dataframe. Pass 'index' to use row names. (Character)

force_equal

Create equal groups by discarding excess data points. Implementation varies between methods. (Logical)

allow_zero

Whether n can be passed as 0. (Logical)

return_factor

Return only grouping factor (Logical)

descending

Change direction of method. (Not fully implemented) (Logical)

randomize

Randomize the grouping factor (Logical)

col_name

Name of added grouping factor

remove_missing_starts

Recursively remove elements from the list of starts that are not found. For method l_starts only. (Logical)

Value

Dataframe grouped by new grouping factor

Examples

Run this code

# NOT RUN {
# Attach packages
library(groupdata2)
library(dplyr)

# Create dataframe
df <- data.frame("x"=c(1:12),
 "species" = rep(c('cat','pig', 'human'), 4),
 "age" = sample(c(1:100), 12))

# Using group()
df_grouped <- group(df, 5, method = 'n_dist')

# Using group() with dplyr pipeline to get mean age
df_means <- df %>%
 group(5, method = 'n_dist') %>%
 dplyr::summarise(mean_age = mean(age))

# Using group_factor() with l_starts
# "c('pig',2)" skips to the second appearance of
# "pig" after the first appearance of "cat"
df_grouped <- group(df,
                    list('cat', c('pig',2), 'human'),
                    method = 'l_starts',
                    starts_col = 'species')

# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

greedy

n_dist (default)

n_fill

n_last

n_rand

l_sizes

l_starts

staircase

primes

Value

See Also

Examples