fmedian: Fast (Grouped, Weighted) Median Value for Matrix-Like Objects

Description

fmedian is a generic function that computes the (column-wise) median value of all values in x, (optionally) grouped by g and/or weighted by w. The TRA argument can further be used to transform x using its (grouped, weighted) median value.

Usage

fmedian(x, …)
# S3 method for default
fmedian(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE,
        use.g.names = TRUE, …)
# S3 method for matrix
fmedian(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE,
        use.g.names = TRUE, drop = TRUE, …)
# S3 method for data.frame
fmedian(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE,
        use.g.names = TRUE, drop = TRUE, …)
# S3 method for grouped_df
fmedian(x, w = NULL, TRA = NULL, na.rm = TRUE,
        use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE, …)

Arguments

a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df').

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

a numeric vector of (non-negative) weights, may contain missing values, but only if x is also missing.

TRA

an integer or quoted operator indicating the transformation to perform: 1 - "replace_fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See TRA.

na.rm

logical. Skip missing values in x. Defaults to TRUE and implemented at very little computational cost. If na.rm = FALSE a NA is returned when encountered.

use.g.names

logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.

drop

matrix and data.frame method: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL.

keep.group_vars

grouped_df method: Logical. FALSE removes grouping variables after computation.

keep.w

grouped_df method: Logical. Retain summed weighting variable after computation (if contained in grouped_df).

…

arguments to be passed to or from other methods.

Value

The (w weighted) median value of x, grouped by g, or (if TRA is used) x transformed by its median, grouped by g.

Details

Median value estimation is done using std::nth_element in C++, which is an efficient partial sorting algorithm. A downside of this is that vectors need to be copied first and then partially sorted, thus fmedian currently requires additional memory equal to the size of the vector (x or a column of x).

Grouped computations are currently performed by mapping the data to a sparse-array and then partially sorting each row (group) of that array. Because of compiler optimizations this requires less memory than a full deep copy done with no groups.

The weighted median is defined as the element k from a set of sorted elements, such that the sum of weights of all elements larger and all elements smaller than k is <= sum(w)/2. If the half-sum of weights (sum(w)/2) is reached exactly for some element k, then (summing from the lower end) both k and k+1 would qualify as the weighted median (and some possible additional elements with zero weights following k would also qualify). fmedian solves these ties by taking a simple arithmetic mean of all elements qualifying as the weighted median.

The weighted median is computed using radixorder to first obtain an ordering of all elements, so it is considerably more computationally expensive than the unweighted version. With groups, the entire vector is also ordered, and the weighted median is computed in a single ordered pass through the data (after group-summing the weights, skipping weights for which x is missing).

If x is a matrix or data frame, these computations are performed independently for each column. When applied to data frames with groups or drop = FALSE, fmedian preserves all column attributes (such as variable labels) but does not distinguish between classed and unclassed objects. The attributes of the data frame itself are also preserved.

Examples

Run this code

# NOT RUN {
## default vector method
mpg <- mtcars$mpg
fmedian(mpg)                         # Simple median value
fmedian(mpg, w = mtcars$hp)          # Weighted median: Weighted by hp
fmedian(mpg, TRA = "-")              # Simple transformation: Subtract median value
fmedian(mpg, mtcars$cyl)             # Grouped median value
fmedian(mpg, mtcars[c(2,8:9)])       # More groups..
g <- GRP(mtcars, ~ cyl + vs + am)    # Precomputing groups gives more speed !
fmedian(mpg, g)
fmedian(mpg, g, mtcars$hp)           # Grouped weighted median
fmedian(mpg, g, TRA = "-")           # Groupwise subtract median value
fmedian(mpg, g, mtcars$hp, "-")      # Groupwise subtract weighted median value

## data.frame method
fmedian(mtcars)
head(fmedian(mtcars, TRA = "-"))
fmedian(mtcars, g)
fmedian(fgroup_by(mtcars, cyl, vs, am))   # Another way of doing it..
fmedian(mtcars, g, use.g.names = FALSE)   # No row-names generated

## matrix method
m <- qM(mtcars)
fmedian(m)
head(fmedian(m, TRA = "-"))
fmedian(m, g) # etc..
# }
# NOT RUN {
 <!-- % No code relying on suggested package -->
library(dplyr)
# grouped_df method
mtcars %>% group_by(cyl,vs,am) %>% fmedian()
mtcars %>% group_by(cyl,vs,am) %>% fmedian(hp)             # Weighted
mtcars %>% fgroup_by(cyl,vs,am) %>% fmedian()              # Faster grouping!
mtcars %>% fgroup_by(cyl,vs,am) %>% fmedian(TRA = "-")     # De-median
mtcars %>% fgroup_by(cyl,vs,am) %>% fselect(mpg, hp) %>%    # Faster selecting
      fmedian(hp, "-")  # Weighted de-median mpg, using hp as weights
# }

Run the code above in your browser using DataLab