BY: Split-Apply-Combine Computing

Description

BY is an S3 generic that efficiently applies functions over vectors or matrix- and data frame columns by groups. Similar to dapply it seeks to retain the structure and attributes of the data, but can also output to various standard formats. A simple parallelism is also available.

Usage

BY(x, …)
# S3 method for default
BY(x, g, FUN, …, use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same", "vector", "list"))
# S3 method for matrix
BY(x, g, FUN, …, use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same", "matrix", "data.frame", "list"))
# S3 method for data.frame
BY(x, g, FUN, …, use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same", "matrix", "data.frame", "list"))
# S3 method for grouped_df
BY(x, FUN, …, keep.group_vars = TRUE, use.g.names = FALSE)

Arguments

a atomic vector, matrix, data frame or alike object.

a GRP object, or a factor / atomic vector / list of atomic vectors (internally converted to a GRP object) used to group x.

FUN

a function, can be scalar- or vector-valued. For vector valued functions see expand.wide and the Note.

…

further arguments to FUN, or to BY.data.frame for the 'grouped_df' method.

use.g.names

logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.

sort

logical. Sort the groups? Internally passed to GRP, and only effective if g is not already a factor or GRP object.

expand.wide

logical. If FUN is a vector-valued function returning a vector of fixed length > 1 (such as the quantile function), expand.wide can be used to return the result in a wider format (instead of stacking the resulting vectors of fixed length above each other in each output column).

parallel

logical. TRUE implements simple parallel execution by internally calling mclapply instead of lapply.

mc.cores

integer. Argument to mclapply indicating the number of cores to use for parallel execution. Can use detectCores() to select all available cores.

return

an integer or string indicating the type of object to return. The default 1 - "same" returns the same object type (i.e. class and other attributes are retained if the underlying data type is the same, just the names for the dimensions are adjusted). 2 - "matrix" always returns the output as matrix, 3 - "data.frame" always returns a data frame and 4 - "list" returns the raw (uncombined) output. Note: 4 - "list" works together with expand.wide to return a list of matrices.

keep.group_vars

grouped_df method: Logical. FALSE removes grouping variables after computation. See also the Note.

Value

X where FUN was applied to every column split by g.

Details

BY is a frugal re-implementation of the Split-Apply-Combine computing paradigm. It is faster than tapply, by, aggregate and plyr, and preserves data attributes just like dapply.

It is principally a wrapper around lapply(gsplit(x, g), FUN, …), that uses gsplit for optimized splitting and also strongly optimizes on the internal code compared to base R functions. For more details look at the documentation for dapply which works very similar (apart from the splitting performed in BY). The function is intended for simple usage involving data aggregation or flexible computation of summary statistics across groups. For larger tasks requiring split-apply-combine computing on data frames, the Fast Statistical Functions and the data.table package are more appropriate tools.

Examples

Run this code

# NOT RUN {
v <- iris$Sepal.Length   # A numeric vector
f <- GRP(iris$Species)   # A grouping

## default vector method
BY(v, f, sum)                                # Sum by species
head(BY(v, f, scale))                        # Scale by species (please use fscale instead)
head(BY(v, f, scale, use.g.names = FALSE))   # Omitting auto-generated names
BY(v, f, quantile)                           # Species quantiles: by default stacked
BY(v, f, quantile, expand.wide = TRUE)       # Wide format

## matrix method
m <- qM(num_vars(iris))
BY(m, f, sum)                          # Also return as matrix
BY(m, f, sum, return = "data.frame")   # Return as data.frame.. also works for computations below
head(BY(m, f, scale))
head(BY(m, f, scale, use.g.names = FALSE))
BY(m, f, quantile)
BY(m, f, quantile, expand.wide = TRUE)
BY(m, f, quantile, expand.wide = TRUE, # Return as list of matrices
   return = "list")

## data.frame method
BY(num_vars(iris), f, sum)             # Also returns a data.fram
BY(num_vars(iris), f, sum, return = 2) # Return as matrix.. also works for computations below
head(BY(num_vars(iris), f, scale))
head(BY(num_vars(iris), f, scale, use.g.names = FALSE))
BY(num_vars(iris), f, quantile)
BY(num_vars(iris), f, quantile, expand.wide = TRUE)
BY(num_vars(iris), f, quantile,        # Return as list of matrices
   expand.wide = TRUE, return = "list")
 
# }
# NOT RUN {
<!-- % No code relying on suggested package -->
# }
# NOT RUN {
## grouped data frame method
library(magrittr) # Note: Used because |> is not available on older R versions
giris <- fgroup_by(iris, Species)
giris %>% BY(sum)                      # Compute sum
giris %>% BY(sum, use.g.names = TRUE,  # Use row.names and
             keep.group_vars = FALSE) # remove 'Species' and groups attribute
giris %>% BY(sum, return = "matrix")   # Return matrix
giris %>% BY(sum, return = "matrix",   # Matrix with row.names
             use.g.names = TRUE)
giris %>% BY(quantile)                 # Compute quantiles (output is stacked)
giris %>% BY(quantile,                 # Much better, also keeps 'Species'
             expand.wide = TRUE)
# }

Run the code above in your browser using DataLab