Learn R Programming

cheapr (version 1.1.0)

as_discrete: Turn continuous data into discrete bins

Description

This is a cheapr version of cut.numeric() which is more efficient and prioritises pretty-looking breaks by default through the use of get_breaks(). Out-of-bounds values can be included naturally through the include_oob argument. Left-closed (right-open) intervals are returned by default in contrast to cut's default right-closed intervals. Furthermore there is flexibility in formatting the interval bins, allowing the user to specify formatting functions and symbols for the interval close and open symbols.

Usage

as_discrete(x, ...)

# S3 method for numeric as_discrete( x, breaks = if (left_closed) get_breaks(x) else cheapr_rev(-get_breaks(-x)), left_closed = TRUE, include_endpoint = FALSE, include_oob = FALSE, ordered = FALSE, intv_start_fun = prettyNum, intv_end_fun = prettyNum, intv_closers = c("[", "]"), intv_openers = c("(", ")"), intv_sep = ",", inf_label = NULL, ... )

# S3 method for integer64 as_discrete(x, ...)

Value

A factor of discrete bins (intervals of start/end pairs).

Arguments

x

A numeric vector.

...

Extra arguments passed onto methods.

breaks

Break-points. The default option creates pretty looking breaks. Unlike cut(), the breaks arg cannot be a number denoting the number of breaks you want. To generate breakpoints this way use get_breaks().

left_closed

Left-closed intervals or right-closed intervals?

include_endpoint

Include endpoint? Default is FALSE.

include_oob

Include out-of-bounds values? Default is FALSE. This is equivalent to breaks = c(breaks, Inf) or breaks = c(-Inf, breaks) when left_closed = FALSE. If include_endpoint = TRUE, the endpoint interval is prioritised before the out-of-bounds interval. This behaviour cannot be replicated easily with cut(). For example, these 2 expressions are not equivalent:

cut(10, c(9, 10, Inf), right = F, include.lowest = T) !=
as_discrete(10, c(9, 10), include_endpoint = T, include_oob = T)

ordered

Should result be an ordered factor? Default is FALSE.

intv_start_fun

Function used to format interval start points.

intv_end_fun

Function used to format interval end points.

intv_closers

A length 2 character vector denoting the symbol to use for closing either left or right closed intervals.

intv_openers

A length 2 character vector denoting the symbol to use for opening either left or right closed intervals.

intv_sep

A length 1 character vector used to separate the start and end points.

inf_label

Label to use for intervals that include infinity. If left NULL the Unicode infinity symbol is used.

See Also

bin get_breaks

Examples

Run this code
library(cheapr)

# `as_discrete()` is very similar to `cut()`
# but more flexible as it allows you to supply
# formatting functions and symbols for the discrete bins

# Here is an example of how to use the formatting functions to
# categorise age groups nicely

ages <- 1:100

age_group <- function(x, breaks){
  age_groups <- as_discrete(
    x,
    breaks = breaks,
    intv_sep = "-",
    intv_end_fun = function(x) x - 1,
    intv_openers = c("", ""),
    intv_closers = c("", ""),
    include_oob = TRUE,
    ordered = TRUE
  )

  # Below is just renaming the last age group

  lvls <- levels(age_groups)
  n_lvls <- length(lvls)
  max_ages <- paste0(max(breaks), "+")
  attr(age_groups, "levels") <- c(lvls[-n_lvls], max_ages)
  age_groups
}

age_group(ages, seq(0, 80, 20))
age_group(ages, seq(0, 25, 5))
age_group(ages, 5)

# To closely replicate `cut()` with `as_discrete()` we can use the following

cheapr_cut <- function(x, breaks, right = TRUE,
                       include.lowest = FALSE,
                       ordered.result = FALSE){
  if (length(breaks) == 1){
    breaks <- get_breaks(x, breaks, pretty = FALSE,
                         expand_min = FALSE, expand_max = FALSE)
    adj <- diff(range(breaks)) * 0.001
    breaks[1] <- breaks[1] - adj
    breaks[length(breaks)] <- breaks[length(breaks)] + adj
  }
  as_discrete(x, breaks, left_closed = !right,
              include_endpoint = include.lowest,
              ordered = ordered.result,
              intv_start_fun = function(x) formatC(x, digits = 3, width = 1),
              intv_end_fun = function(x) formatC(x, digits = 3, width = 1))
}

x <- rnorm(100)
cheapr_cut(x, 10)
identical(cut(x, 10), cheapr_cut(x, 10))

Run the code above in your browser using DataLab