BY: Split-Apply-Combine Computing

Description

BY is an S3 generic that efficiently applies functions over vectors or matrix- and data.frame columns by groups, and returns various output formats. A simple parallelism is also available.

Usage

BY(X, ...)
# S3 method for default
BY(X, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same","list"))
# S3 method for matrix
BY(X, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same","matrix","data.frame","list"))
# S3 method for data.frame
BY(X, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same","matrix","data.frame","list"))
# S3 method for grouped_df
BY(X, FUN, ..., use.g.names = FALSE, keep.group_vars = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same","matrix","data.frame","list"))

Arguments

a atomic vector, matrix or data frame.

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

FUN

a function, can be scalar- or vector-valued.

...

further arguments to FUN.

use.g.names

make group-names and add to the result as names (vector method) or row-names (matrix and data.frame method). No row-names are generated for data.tables and grouped tibbles.

sort

logical. Sort the groups? Internally passed to GRP or qF, and only effective if g is not already a factor or GRP object.

expand.wide

logical. If FUN is a vector-valued function returning a vector of fixed length > 1 (such as the quantile function), expand.wide can be used to return the result in a wider format (instead of stacking the resulting vectors of fixed length above each other in each output column).

parallel

logical. TRUE implements simple parallel execution by internally calling parallel::mclapply instead of base::lapply.

mc.cores

integer. Argument to parallel::mclapply indicating the number of cores to use for parallel execution. Can use parallel::detectCores() to select all available cores. See also ?parallel::mclapply.

return

an integer or string indicating the type of object to return. The default 1 - "same" returns the same object type (i.e. passing a matrix returns a matrix and passing a data frame returns a data frame). 2 - "matrix" always returns the output as matrix, 3 - "data.frame" always returns a data frame and 4 - "list" returns the raw (uncombined) output. Note: 4 - "list" works together with expand.wide to return a list of matrices.

keep.group_vars

grouped_df method: Logical. FALSE removes grouping variables after computation.

Value

X where FUN was applied to every column split by g.

Details

BY is a frugal reimplementation of the Split-Apply-Combine computing paradigm. It is faster than base::tapply, base::by, base::aggregate and plyr, and preserves data attributes just like dapply.

I note at this point that the philosophy of collapse is to move beyond this rather slow computing paradigm, which is why the Fast Statistical Functions were implemented. However sometimes tasks need to be performed that involve more complex and customized operations on data, and for these cases BY is a good solution.

BY is built principally as a wrapper around lapply(split(x, g), FUN, ...), but strongly optimizes on attribute checking compared to base R. For more details examine the code yourself or look at the documentation for dapply which works very similar (the only difference really is the splitting performed in BY).

BY is used internally in collap (collapse's main aggregation command) for functions that are not Fast Statistical Functions.

Examples

Run this code

# NOT RUN {
v <- iris$Sepal.Length   # A numeric vector
f <- iris$Species        # A factor. Vectors/lists will internally be converted to factor

## default vector method
BY(v, f, sum)                          # Sum by species
BY(v, f, scale)                        # Scale by species (please use fscale instead)
BY(v, f, scale, use.g.names = FALSE)   # Omitting auto-generated names
BY(v, f, quantile)                     # Species quantiles: by default stacked
BY(v, f, quantile, expand.wide = TRUE) # Wide format

## matrix method
m <- qM(num_vars(iris))
BY(m, f, sum)                          # Also return as matrix
BY(m, f, sum, return = "data.frame")   # Return as data.frame ... also works for computations below
BY(m, f, scale)
BY(m, f, scale, use.g.names = FALSE)
BY(m, f, quantile)
BY(m, f, quantile, expand.wide = TRUE)
BY(m, f, quantile, expand.wide = TRUE, # Return as list of matrices
   return = "list")

## data.frame method
BY(num_vars(iris), f, sum)             # Also returns a data.fram
BY(num_vars(iris), f, sum, return = 2) # Return as matrix ... also works for computations below
BY(num_vars(iris), f, scale)
BY(num_vars(iris), f, scale, use.g.names = FALSE)
BY(num_vars(iris), f, quantile)
BY(num_vars(iris), f, quantile, expand.wide = TRUE)
BY(num_vars(iris), f, quantile,        # Return as list of matrices
   expand.wide = TRUE, return = "list")

## grouped tibble method
library(dplyr)
giris <- group_by(iris, Species)
giris %>% BY(sum)                     # Compute sum
giris %>% BY(sum, use.g.names = TRUE, # Use row.names and
             keep.group_vars = FALSE) # remove 'Species' and groups attribute
giris %>% BY(sum, return = "matrix")  # Return matrix
giris %>% BY(sum, return = "matrix",  # Matrix with row.names
                    use.g.names = TRUE)
giris %>% BY(log)                     # Take logs
giris %>% BY(log, use.g.names = TRUE, # Use row.names and
             keep.group_vars = FALSE) # remove 'Species' and groups attribute
giris %>% BY(quantile)                # Compute quantiles (output is stacked)
giris %>% BY(quantile,                # Much better, also keeps 'Species'
             expand.wide = TRUE)
# }

Run the code above in your browser using DataLab