With fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, fnth, ffirst, flast, fnobs and fndistinct, collapse presents a coherent set of extremely fast and flexible statistical functions (S3 generics) to perform column-wise, grouped and weighted computations on vectors, matrices and data frames, with special support for grouped data frames / tibbles (dplyr) and data.table's.
x suitably aggregated or transformed. Data frame column-attributes and overall attributes are generally preserved if the output is of the same data type.
## All functions (FUN) follow a common syntax in 4 methods:
FUN(x, ...)## Default S3 method:
FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
    use.g.names = TRUE, [nthreads = 1L,] ...)
## S3 method for class 'matrix'
FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
    use.g.names = TRUE, drop = TRUE, [nthreads = 1L,] ...)
## S3 method for class 'data.frame'
FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
    use.g.names = TRUE, drop = TRUE, [nthreads = 1L,] ...)
## S3 method for class 'grouped_df'
FUN(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
    use.g.names = FALSE, keep.group_vars = TRUE,
    [keep.w = TRUE,] [stub = TRUE,] [nthreads = 1L,] ...)
| x | a vector, matrix, data frame or grouped data frame (class 'grouped_df'). | |
| g | a factor, GRPobject, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to aGRPobject) used to groupx. | |
| w | a numeric vector of (non-negative) weights, may contain missing values. Supported by fsum,fprod,fmean,fmedian,fnth,fvar,fsdandfmode. | |
| TRA | an integer or quoted operator indicating the transformation to perform:
0 - "na"     |     1 - "fill"     |     2 - "replace"     |     3 - "-"     |     4 - "-+"     |     5 - "/"     |     6 - "%"     |     7 - "+"     |     8 - "*"     |     9 - "%%"     |     10 - "-%%". See TRA. | |
| na.rm | logical. Skip missing values in x. Defaults toTRUEin all functions and implemented at very little computational cost. Not available forfnobs. | |
| use.g.names | logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. | |
| nthreads | integer. The number of threads to utilize. Supported by fsum,fmean,fmedian,fnth,fmodeandfndistinct. | |
| drop | matrix and data.frame methods: Logical. TRUEdrops dimensions and returns an atomic vector ifg = NULLandTRA = NULL. | |
| keep.group_vars | grouped_df method: Logical. FALSEremoves grouping variables after computation. By default grouping variables are added, even if not present in the grouped_df. | |
| keep.w | grouped_df method: Logical. TRUE(default) also aggregates weights and saves them in a column,FALSEremoves weighting variable after computation (if contained ingrouped_df). | |
| stub | grouped_df method: Character. If keep.w = TRUEandstub = TRUE(default), the aggregated weights column is prefixed by the name of the aggregation function (mostly"sum."). Users can specify a different prefix through this argument, or set it toFALSEto avoid prefixing. | |
| ... | arguments to be passed to or from other methods. If TRAis used, passingset = TRUEwill transform data by reference and return the result invisibly (except for the grouped_df method which always returns visible output). | 
Functions fquantile and frange are for atomic vectors.
Panel-decomposed (i.e. between and within) statistics as well as grouped and weighted skewness and kurtosis are implemented in qsu.
The vector-valued functions and operators fcumsum, fscale/STD, fbetween/B, fhdbetween/HDB, fwithin/W, fhdwithin/HDW, flag/L/F, fdiff/D/Dlog and fgrowth/G are grouped under Data Transformations and Time Series and Panel Series. These functions also support indexed data (plm).
## default vector method
mpg <- mtcars$mpg
fsum(mpg)                         # Simple sum
fsum(mpg, TRA = "/")              # Simple transformation: divide all values by the sum
fsum(mpg, mtcars$cyl)             # Grouped sum
fmean(mpg, mtcars$cyl)            # Grouped mean
fmean(mpg, w = mtcars$hp)         # Weighted mean, weighted by hp
fmean(mpg, mtcars$cyl, mtcars$hp) # Grouped mean, weighted by hp
fsum(mpg, mtcars$cyl, TRA = "/")  # Proportions / division by group sums
fmean(mpg, mtcars$cyl, mtcars$hp, # Subtract weighted group means, see also ?fwithin
      TRA = "-")## data.frame method
fsum(mtcars)
fsum(mtcars, TRA = "%")                  # This computes percentages
fsum(mtcars, mtcars[c(2,8:9)])           # Grouped column sum
g <- GRP(mtcars, ~ cyl + vs + am)        # Here precomputing the groups!
fsum(mtcars, g)                          # Faster !!
fmean(mtcars, g, mtcars$hp)
fmean(mtcars, g, mtcars$hp, "-")         # Demeaning by weighted group means..
fmean(fgroup_by(mtcars, cyl, vs, am), hp, "-")  # Another way of doing it..
fmode(wlddev, drop = FALSE)              # Compute statistical modes of variables in this data
fmode(wlddev, wlddev$income)             # Grouped statistical modes ..
## matrix method
m <- qM(mtcars)
fsum(m)
fsum(m, g) # ..
## method for grouped data frames - created with dplyr::group_by or fgroup_by
library(dplyr)
mtcars |> group_by(cyl,vs,am) |> select(mpg,carb) |> fsum()
mtcars |> fgroup_by(cyl,vs,am) |> fselect(mpg,carb) |> fsum() # equivalent and faster !!
mtcars |> fgroup_by(cyl,vs,am) |> fsum(TRA = "%")
mtcars |> fgroup_by(cyl,vs,am) |> fmean(hp)         # weighted grouped mean, save sum of weights
mtcars |> fgroup_by(cyl,vs,am) |> fmean(hp, keep.group_vars = FALSE)
## This compares fsum with data.table (2 threads) and base::rowsum
# Starting with small data
mtcDT <- qDT(mtcars)
f <- qF(mtcars$cyl)library(microbenchmark)
microbenchmark(mtcDT[, lapply(.SD, sum), by = f],
               rowsum(mtcDT, f, reorder = FALSE),
               fsum(mtcDT, f, na.rm = FALSE), unit = "relative")
#                              expr        min         lq      mean    median        uq       max neval cld
# mtcDT[, lapply(.SD, sum), by = f] 145.436928 123.542134 88.681111 98.336378 71.880479 85.217726   100   c
# rowsum(mtcDT, f, reorder = FALSE)   2.833333   2.798203  2.489064  2.937889  2.425724  2.181173   100  b
#     fsum(mtcDT, f, na.rm = FALSE)   1.000000   1.000000  1.000000  1.000000  1.000000  1.000000   100 a
# Now larger data
tdata <- qDT(replicate(100, rnorm(1e5), simplify = FALSE)) # 100 columns with 100.000 obs
f <- qF(sample.int(1e4, 1e5, TRUE))                        # A factor with 10.000 groups
microbenchmark(tdata[, lapply(.SD, sum), by = f],
               rowsum(tdata, f, reorder = FALSE),
               fsum(tdata, f, na.rm = FALSE), unit = "relative")
#                              expr      min       lq     mean   median       uq       max neval cld
# tdata[, lapply(.SD, sum), by = f] 2.646992 2.975489 2.834771 3.081313 3.120070 1.2766475   100   c
# rowsum(tdata, f, reorder = FALSE) 1.747567 1.753313 1.629036 1.758043 1.839348 0.2720937   100  b
#     fsum(tdata, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100 a
Please see the documentation of individual functions.
Collapse Overview, Data Transformations, Time Series and Panel Series