roll: Rolling functions

Description

Fast rolling functions to calculate aggregates on sliding windows. Function name and arguments are experimental.

Usage

frollmean(x, n, fill=NA, algo=c("fast", "exact"),
          align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE)
frollsum(x, n, fill=NA, algo=c("fast","exact"),
         align=c("right", "left", "center"), na.rm=FALSE, hasNA=NA, adaptive=FALSE)
frollapply(x, n, FUN, ..., fill=NA, align=c("right", "left", "center"))

Value

A list except when the input is a vector and

length(n)==1 in which case a vector is returned.

Arguments

x: Vector, data.frame or data.table of integer, numeric or logical columns over which to calculate the windowed aggregations. May also be a list, in which case the rolling function is applied to each of its elements.
n: Integer vector giving rolling window size(s). This is the total number of included values. Adaptive rolling functions also accept a list of integer vectors.
fill: Numeric; value to pad by. Defaults to NA.
algo: Character, default "fast". When set to "exact", a slower (but more accurate) algorithm is used. It suffers less from floating point rounding errors by performing an extra pass, and carefully handles all non-finite values. It will use multiple cores where available. See Details for more information.
align: Character, specifying the "alignment" of the rolling window, defaulting to "right". "right" covers preceding rows (the window ends on the current value); "left" covers following rows (the window starts on the current value); "center" is halfway in between (the window is centered on the current value, biased towards "left" when n is even).
na.rm: Logical, default FALSE. Should missing values be removed when calculating window? For details on handling other non-finite values, see Details.
hasNA: Logical. If it is known that x contains NA then setting this to TRUE will speed up calculation. Defaults to NA.
adaptive: Logical, default FALSE. Should the rolling function be calculated adaptively? See Details below.
FUN: The function to be applied to the rolling window; see Details for restrictions.
...: Extra arguments passed to FUN in frollapply.

Details

froll* functions accept vectors, lists, data.frames or data.tables. They always return a list except when the input is a vector and length(n)==1, in which case a vector is returned, for convenience. Thus, rolling functions can be used conveniently within data.table syntax.

Argument n allows multiple values to apply rolling functions on multiple window sizes. If adaptive=TRUE, then n must be a list. Each list element must be integer vector of window sizes corresponding to every single observation in each column; see Examples.

When algo="fast" an "on-line" algorithm is used, and all of NaN, +Inf, -Inf are treated as NA. Setting algo="exact" will make rolling functions to use a more computationally-intensive algorithm that suffers less from floating point rounding error (the same consideration applies to mean). algo="exact" also handles NaN, +Inf, -Inf consistently to base R. In case of some functions (like mean), it will additionally make extra pass to perform floating point error correction. Error corrections might not be truly exact on some platforms (like Windows) when using multiple threads.

Adaptive rolling functions are a special case where each observation has its own corresponding rolling window width. Due to the logic of adaptive rolling functions, the following restrictions apply:

align only "right".
if list of vectors is passed to x, then all vectors within it must have equal length.

When multiple columns or multiple windows width are provided, then they are run in parallel. The exception is for algo="exact", which runs in parallel already.

frollapply computes rolling aggregate on arbitrary R functions. The input x (first argument) to the function FUN is coerced to numeric beforehand and FUN has to return a scalar numeric value. Checks for that are made only during the first iteration when FUN is evaluated. Edge cases can be found in examples below. Any R function is supported, but it is not optimized using our own C implementation -- hence, for example, using frollapply to compute a rolling average is inefficient. It is also always single-threaded because there is no thread-safe API to R's C eval. Nevertheless we've seen the computation speed up vis-a-vis versions implemented in base R.

References

Round-off error

Examples

Run this code

d = as.data.table(list(1:6/2, 3:8/4))
# rollmean of single vector and single window
frollmean(d[, V1], 3)
# multiple columns at once
frollmean(d, 3)
# multiple windows at once
frollmean(d[, .(V1)], c(3, 4))
# multiple columns and multiple windows at once
frollmean(d, c(3, 4))
## three calls above will use multiple cores when available

# partial window using adaptive rolling function
an = function(n, len) c(seq.int(n), rep(n, len-n))
n = an(3, nrow(d))
frollmean(d, n, adaptive=TRUE)

# frollsum
frollsum(d, 3:4)

# frollapply
frollapply(d, 3:4, sum)
f = function(x, ...) if (sum(x, ...)>5) min(x, ...) else max(x, ...)
frollapply(d, 3:4, f, na.rm=TRUE)

# performance vs exactness
set.seed(108)
x = sample(c(rnorm(1e3, 1e6, 5e5), 5e9, 5e-9))
n = 15
ma = function(x, n, na.rm=FALSE) {
  ans = rep(NA_real_, nx<-length(x))
  for (i in n:nx) ans[i] = mean(x[(i-n+1):i], na.rm=na.rm)
  ans
}
fastma = function(x, n, na.rm) {
  if (!missing(na.rm)) stop("NAs are unsupported, wrongly propagated by cumsum")
  cs = cumsum(x)
  scs = shift(cs, n)
  scs[n] = 0
  as.double((cs-scs)/n)
}
system.time(ans1<-ma(x, n))
system.time(ans2<-fastma(x, n))
system.time(ans3<-frollmean(x, n))
system.time(ans4<-frollmean(x, n, algo="exact"))
system.time(ans5<-frollapply(x, n, mean))
anserr = list(
  fastma = ans2-ans1,
  froll_fast = ans3-ans1,
  froll_exact = ans4-ans1,
  frollapply = ans5-ans1
)
errs = sapply(lapply(anserr, abs), sum, na.rm=TRUE)
sapply(errs, format, scientific=FALSE) # roundoff

# frollapply corner cases
f = function(x) head(x, 2)     ## FUN returns non length 1
try(frollapply(1:5, 3, f))
f = function(x) {              ## FUN sometimes returns non length 1
  n = length(x)
  # length 1 will be returned only for first iteration where we check length
  if (n==x[n]) x[1L] else range(x) # range(x)[2L] is silently ignored!
}
frollapply(1:5, 3, f)
options(datatable.verbose=TRUE)
x = c(1,2,1,1,1,2,3,2)
frollapply(x, 3, uniqueN)     ## FUN returns integer
numUniqueN = function(x) as.numeric(uniqueN(x))
frollapply(x, 3, numUniqueN)
x = c(1,2,1,1,NA,2,NA,2)
frollapply(x, 3, anyNA)       ## FUN returns logical
as.logical(frollapply(x, 3, anyNA))
options(datatable.verbose=FALSE)
f = function(x) {             ## FUN returns character
  if (sum(x)>5) "big" else "small"
}
try(frollapply(1:5, 3, f))
f = function(x) {             ## FUN is not type-stable
  n = length(x)
  # double type will be returned only for first iteration where we check type
  if (n==x[n]) 1 else NA # NA logical turns into garbage without coercion to double
}
try(frollapply(1:5, 3, f))

Run the code above in your browser using DataLab