curve_interval: Curvewise point and interval summaries for tidy data frames of draws from distributions

Description

Translates draws from distributions in a grouped data frame into a set of point and interval summaries using a curve boxplot-inspired approach.

Usage

curve_interval(
  .data,
  ...,
  .along = NULL,
  .width = 0.5,
  .interval = c("mhd", "mbd", "bd", "bd-mbd"),
  .simple_names = TRUE,
  na.rm = FALSE,
  .exclude = c(".chain", ".iteration", ".draw", ".row")
)

Arguments

.data

Data frame (or grouped data frame as returned by group_by()) that contains draws to summarize.

...

Bare column names or expressions that, when evaluated in the context of .data, represent draws to summarize. If this is empty, then by default all columns that are not group columns and which are not in .exclude (by default ".chain", ".iteration", ".draw", and ".row") will be summarized. This can be list columns.

.along

Which columns are the input values to the function describing the curve (e.g., the "x" values). Supports tidyselect syntax, as in dplyr::select(). Intervals are calculated jointly with respect to these variables, conditional on all other grouping variables in the data frame. The default (NULL) causes curve_interval() to use all grouping variables in the input data frame, which will generate the most conservative intervals. However, if you want to calculate intervals for some function y = f(x) conditional on some other variable(s) (say, conditional on a factor g), you would group by g, then use .along = x to calculate intervals jointly over x conditional on g.

.width

vector of probabilities to use that determine the widths of the resulting intervals. If multiple probabilities are provided, multiple rows per group are generated, each with a different probability interval (and value of the corresponding .width column).

.interval

The method used to calculate the intervals. Currently, all methods rank the curves using some measure of data depth, then create envelopes containing the .width% "deepest" curves. Available methods are:

"mhd": mean halfspace depth (Fraiman and Muniz 2001).
"mbd": modified band depth (Sun and Genton 2011): calls fda::fbplot() with method = "MBD".
"bd": band depth (Sun and Genton 2011): calls fda::fbplot() with method = "BD2".
"bd-mbd": band depth, breaking ties with modified band depth (Sun and Genton 2011): calls fda::fbplot() with method = "Both".

.simple_names

When TRUE and only a single column / vector is to be summarized, use the name .lower for the lower end of the interval and .upper for the upper end. If .data is a vector and this is TRUE, this will also set the column name of the point summary to .value. When FALSE and .data is a data frame, names the lower and upper intervals for each column x x.lower and x.upper. When FALSE and .data is a vector, uses the naming scheme y, ymin and ymax (for use with ggplot).

na.rm

logical value indicating whether NA values should be stripped before the computation proceeds. If FALSE (the default), the presence of NA values in the columns to be summarized will generally result in an error. If TRUE, NA values will be removed in the calculation of intervals so long as .interval is "mhd"; other methods do not currently support na.rm. Be cautious in applying this parameter: in general, it is unclear what a joint interval should be when any of the values are missing!

.exclude

A character vector of names of columns to be excluded from summarization if no column names are specified to be summarized. Default ignores several meta-data column names used in tidybayes.

Value

A data frame containing point summaries and intervals, with at least one column corresponding to the point summary, one to the lower end of the interval, one to the upper end of the interval, the width of the interval (.width), the type of point summary (.point), and the type of interval (.interval).

Details

Intervals are calculated by ranking the curves using some measure of data depth, then creating envelopes containing the .width% "deepest" curves (for each value of .width). Thus, the intervals are guaranteed to contain at least .width% of the full curves, but may be conservative (i.e. they may contain more than .width% of the curves). See Mirzargar et al. (2014) or Juul et al. (2020) for an accessible introduction to the idea.

References

Fraiman, Ricardo and Graciela Muniz. (2001). "Trimmed means for functional data". Test 10: 419<U+2013>440. 10.1007/BF02595706.

Sun, Ying and Marc G. Genton. (2011). "Functional Boxplots". Journal of Computational and Graphical Statistics, 20(2): 316-334. 10.1198/jcgs.2011.09224

Mirzargar, Mahsa, Ross T Whitaker, and Robert M Kirby. (2014). "Curve Boxplot: Generalization of Boxplot for Ensembles of Curves". IEEE Transactions on Visualization and Computer Graphics. 20(12): 2654-2663. 10.1109/TVCG.2014.2346455

Juul Jonas, Kaare Gr<U+00E6>sb<U+00F8>ll, Lasse Engbo Christiansen, and Sune Lehmann. (2020). "Fixed-time descriptive statistics underestimate extremes of epidemic curve ensembles". arXiv e-print. arXiv:2007.05035

Examples

Run this code

# NOT RUN {
library(dplyr)
library(tidyr)
library(ggplot2)

# generate a set of curves
k = 11 # number of curves
n = 201
df = tibble(
    .draw = 1:k,
    mean = seq(-5,5, length.out = k),
    x = list(seq(-15,15,length.out = n))
  ) %>%
  unnest(x) %>%
  mutate(y = dnorm(x, mean, 3))

# see pointwise intervals...
df %>%
  group_by(x) %>%
  median_qi(y, .width = c(.5)) %>%
  ggplot(aes(x = x, y = y)) +
  geom_lineribbon(aes(ymin = .lower, ymax = .upper)) +
  geom_line(aes(group = .draw), alpha=0.15, data = df) +
  scale_fill_brewer() +
  ggtitle("50% pointwise intervals with point_interval()") +
  theme_ggdist()

# ... compare them to curvewise intervals
df %>%
  group_by(x) %>%
  curve_interval(y, .width = c(.5)) %>%
  ggplot(aes(x = x, y = y)) +
  geom_lineribbon(aes(ymin = .lower, ymax = .upper)) +
  geom_line(aes(group = .draw), alpha=0.15, data = df) +
  scale_fill_brewer() +
  ggtitle("50% curvewise intervals with curve_interval()") +
  theme_ggdist()

# }

Run the code above in your browser using DataLab