pair_extremes: Pair extreme values and sort by the pairs

Description

lifecycle::badge("experimental")

The values are paired/grouped such that the lowest and highest values form the first group, the second lowest and the second highest values form the second group, and so on. The values are then sorted by these groups/pairs.

When `data` has an uneven number of rows, the `unequal_method` determines which group should have only 1 element.

The *_vec() version takes and returns a vector.

Example:

The column values:

c(1, 2, 3, 4, 5, 6)

Creates the sorting factor:

c(1, 2, 3, 3, 2, 1)

And are ordered as:

c(1, 6, 2, 5, 3, 4)

Usage

pair_extremes(
  data,
  col = NULL,
  unequal_method = "middle",
  num_pairings = 1,
  balance = "mean",
  order_by_aggregates = FALSE,
  shuffle_members = FALSE,
  shuffle_pairs = FALSE,
  factor_name = ifelse(num_pairings == 1, ".pair", ".pairing"),
  overwrite = FALSE
)
pair_extremes_vec(
  data,
  unequal_method = "middle",
  num_pairings = 1,
  balance = "mean",
  order_by_aggregates = FALSE,
  shuffle_members = FALSE,
  shuffle_pairs = FALSE
)

Value

The sorted data.frame (tibble) / vector. Optionally with the sorting factor added.

When `data` is a vector and `keep_factors` is FALSE, the output will be a vector. Otherwise, a data.frame.

Arguments

data

data.frame or vector.

col

Column to create sorting factor by. When `NULL` and `data` is a data.frame, the row numbers are used.

unequal_method

Method for dealing with an unequal number of rows/elements in `data`.

One of: first, middle or last

first

The first group will have size 1.

Example:

The ordered column values:

c(1, 2, 3, 4, 5)

Creates the sorting factor:

c(1, 2, 3, 3, 2)

And are ordered as:

c(1, 2, 5, 3, 4)

middle

The middle group will have size 1.

Example:

The ordered column values:

c(1, 2, 3, 4, 5)

Creates the sorting factor:

c(1, 3, 2, 3, 1)

And are ordered as:

c(1, 5, 3, 2, 4)

last

The last group will have size 1.

Example:

The ordered column values:

c(1, 2, 3, 4, 5)

Creates the sorting factor:

c(1, 2, 2, 1, 3)

And are ordered as:

c(1, 4, 2, 3, 5)

num_pairings

Number of pairings to perform (recursively). At least 1.

Based on `balance`, the secondary pairings perform extreme pairing on either the sum, absolute difference, min, or max of the pair elements.

balance

What to balance pairs for in a given secondary pairing. Either "mean", "spread", "min", or "max". Can be a single string used for all secondary pairings or one for each secondary pairing (`num_pairings` - 1).

The first pairing always pairs the actual element values.

mean

Pairs have similar means. The values in the pairs from the previous pairing are aggregated with `sum()` and paired.

spread

Pairs have similar spread (e.g. standard deviations). The values in the pairs from the previous pairing are aggregated with `sum(abs(diff()))` and paired.

min / max

Pairs have similar minimum / maximum values. The values in the pairs from the previous pairing are aggregated with `min()` / `max()` and paired.

order_by_aggregates

Whether to order the pairs from initial pairings (first `num_pairings` - 1) by their aggregate values instead of their pair identifiers.

N.B. Only used when `num_pairings` > 1.

shuffle_members

Whether to shuffle the order of the group members within the groups. (Logical)

shuffle_pairs

Whether to shuffle the order of the pairs. Pair members remain together. (Logical)

factor_name

Name of new column with the sorting factor. If `NULL`, no column is added.

overwrite

Whether to allow overwriting of existing columns. (Logical)

Author

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

Examples

Run this code

# Attach packages
library(rearrr)
library(dplyr)

# Set seed
set.seed(1)

# Create a data frame
df <- data.frame(
  "index" = 1:10,
  "A" = sample(1:10),
  "B" = runif(10),
  "C" = LETTERS[1:10],
  "G" = c(
    1, 1, 1, 2, 2,
    2, 3, 3, 3, 3
  ),
  stringsAsFactors = FALSE
)

# Pair extreme indices (row numbers)
pair_extremes(df)

# Pair extremes in each of the columns
pair_extremes(df, col = "A")$A
pair_extremes(df, col = "B")$B
pair_extremes(df, col = "C")$C

# Shuffle the members pair-wise
# The rows within each pair are shuffled
# while the `.pair` column maintains it order
pair_extremes(df, col = "A", shuffle_members = TRUE)

# Shuffle the order of the pairs
# The rows within each pair maintain their order
# and stay together but the `.pair` column is shuffled
pair_extremes(df, col = "A", shuffle_pairs = TRUE)

# Use recursive pairing
# Mostly meaningful with much larger datasets
# Order initial grouping by pair identifiers
pair_extremes(df, col = "A", num_pairings = 2)
# Order initial grouping by aggregate values
pair_extremes(df, col = "A", num_pairings = 2, order_by_aggregates = TRUE)

# Grouped by G
# Each G group only has 3 elements
# so it only creates 1 pair and a group
# with the single excessive element
# per G group
df %>%
  dplyr::select(G, A) %>% # For clarity
  dplyr::group_by(G) %>%
  pair_extremes(col = "A")

# Plot the extreme pairs
plot(
  x = 1:10,
  y = pair_extremes(df, col = "B")$B,
  col = as.character(rep(1:5, each = 2))
)
# With shuffled pair members (run a few times)
plot(
  x = 1:10,
  y = pair_extremes(df, col = "B", shuffle_members = TRUE)$B,
  col = as.character(rep(1:5, each = 2))
)
# With shuffled pairs (run a few times)
plot(
  x = rep(1:5, each = 2),
  y = pair_extremes(df, col = "B", shuffle_pairs = TRUE)$B,
  col = as.character(rep(1:5, each = 2))
)

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

first

middle

last

mean

spread

min / max

Author

See Also

Examples