fsum
is a generic function that computes the (column-wise) sum of all values in x
, (optionally) grouped by g
and/or weighted by w
(e.g., to calculate survey totals). The TRA
argument can further be used to transform x
using its (grouped, weighted) sum.
fsum(x, …)# S3 method for default
fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = TRUE, …)
# S3 method for matrix
fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = TRUE, drop = TRUE, …)
# S3 method for data.frame
fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = TRUE, drop = TRUE, …)
# S3 method for grouped_df
fsum(x, w = NULL, TRA = NULL, na.rm = TRUE,
use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE, …)
a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df').
a factor, GRP
object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP
object) used to group x
.
a numeric vector of (non-negative) weights, may contain missing values.
an integer or quoted operator indicating the transformation to perform:
1 - "replace_fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See TRA
.
logical. Skip missing values in x
. Defaults to TRUE
and implemented at very little computational cost. If na.rm = FALSE
a NA
is returned when encountered.
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.
matrix and data.frame method: Logical. TRUE
drops dimensions and returns an atomic vector if g = NULL
and TRA = NULL
.
grouped_df method: Logical. FALSE
removes grouping variables after computation.
grouped_df method: Logical. Retain summed weighting variable after computation (if contained in grouped_df
).
arguments to be passed to or from other methods.
The (w
weighted) sum of x
, grouped by g
, or (if TRA
is used) x
transformed by its sum, grouped by g
.
## default vector method mpg <- mtcars$mpg fsum(mpg) # Simple sum fsum(mpg, w = mtcars$hp) # Weighted sum (total): Weighted by hp fsum(mpg, TRA = "%") # Simple transformation: obtain percentages of mpg fsum(mpg, mtcars$cyl) # Grouped sum fsum(mpg, mtcars$cyl, mtcars$hp) # Weighted grouped sum (total) fsum(mpg, mtcars[c(2,8:9)]) # More groups.. g <- GRP(mtcars, ~ cyl + vs + am) # Precomputing groups gives more speed ! fsum(mpg, g) fmean(mpg, g) == fsum(mpg, g) / fnobs(mpg, g) fsum(mpg, g, TRA = "%") # Percentages by group## data.frame method fsum(mtcars) fsum(mtcars, TRA = "%") fsum(mtcars, g) fsum(mtcars, g, TRA = "%")
## matrix method m <- qM(mtcars) fsum(m) fsum(m, TRA = "%") fsum(m, g) fsum(m, g, TRA = "%") \donttest{ ## method for grouped data frames - created with dplyr::group_by or fgroup_by library(dplyr) mtcars %>% group_by(cyl,vs,am) %>% fsum(hp) # Weighted grouped sum (total) mtcars %>% fgroup_by(cyl,vs,am) %>% fsum(hp) # Equivalent and faster !! mtcars %>% fgroup_by(cyl,vs,am) %>% fsum(TRA = "%") mtcars %>% fgroup_by(cyl,vs,am) %>% fselect(mpg) %>% fsum() }
## This compares fsum with data.table (2 threads) and base::rowsum # Starting with small data mtcDT <- qDT(mtcars) f <- qF(mtcars$cyl)library(microbenchmark) microbenchmark(mtcDT[, lapply(.SD, sum), by = f], rowsum(mtcDT, f, reorder = FALSE), fsum(mtcDT, f, na.rm = FALSE), unit = "relative")
expr min lq mean median uq max neval cld mtcDT[, lapply(.SD, sum), by = f] 145.436928 123.542134 88.681111 98.336378 71.880479 85.217726 100 c rowsum(mtcDT, f, reorder = FALSE) 2.833333 2.798203 2.489064 2.937889 2.425724 2.181173 100 b fsum(mtcDT, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
# Now larger data tdata <- qDT(replicate(100, rnorm(1e5), simplify = FALSE)) # 100 columns with 100.000 obs f <- qF(sample.int(1e4, 1e5, TRUE)) # A factor with 10.000 groups
microbenchmark(tdata[, lapply(.SD, sum), by = f], rowsum(tdata, f, reorder = FALSE), fsum(tdata, f, na.rm = FALSE), unit = "relative")
expr min lq mean median uq max neval cld tdata[, lapply(.SD, sum), by = f] 2.646992 2.975489 2.834771 3.081313 3.120070 1.2766475 100 c rowsum(tdata, f, reorder = FALSE) 1.747567 1.753313 1.629036 1.758043 1.839348 0.2720937 100 b fsum(tdata, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a
Missing-value removal as controlled by the na.rm
argument is done very efficiently by simply skipping them in the computation (thus setting na.rm = FALSE
on data with no missing values doesn't give extra speed). Large performance gains can nevertheless be achieved in the presence of missing values if na.rm = FALSE
, since then the corresponding computation is terminated once a NA
is encountered and NA
is returned (unlike sum
which just runs through without any checks).
The weighted sum (e.g., survey total) is computed as sum(x * w)
, but in one pass and about twice as efficient. If na.rm = TRUE
, missing values will be removed from both x
and w
i.e. utilizing only x[complete.cases(x,w)]
and w[complete.cases(x,w)]
.
This all seamlessly generalizes to grouped computations, which are performed in a single pass (without splitting the data) and are therefore extremely fast. See Benchmark and Examples below.
When applied to data frames with groups or drop = FALSE
, fsum
preserves all column attributes (such as variable labels), unless columns have a class (checked using is.object
). The attributes of the data frame itself are also preserved.
Since v1.6.0 fsum
explicitly supports integers. Integers are summed using the long long type in C which is bounded at +-9,223,372,036,854,775,807 (so ~4.3 billion times greater than the minimum/maximum R integer bounded at +-2,147,483,647). If the value of the sum is outside +-2,147,483,647, a double containing the result is returned, otherwise an integer is returned. With groups, an integer overflow error is provided if the sum in any group is outside +-2,147,483,647.