ffapply: Apply for ff objects

Description

The ffapply functions support convenient batched processing of ff objects such that each single batch or chunk will not exhaust RAM and such that batchs have sizes as similar as possible, see bbatch. Differing from R's standard apply which applies a FUNction, the ffapply functions do apply an EXPRession and provide two indices FROM="i1" and TO="i2", which mark beginning and end of the batch and can be used in the applied expression. Note that the ffapply functions change the two indices in their parent frame, to avoid conflicts you can use different names through FROM="i1" and TO="i2". For support of creating return values see details.

Usage

ffvecapply(EXPR, X = NULL, N = NULL, VMODE = NULL, VBYTES = NULL, RETURN = FALSE
, CFUN = NULL, USE.NAMES = TRUE, FF_RETURN = TRUE, BREAK = ".break"
, FROM = "i1", TO = "i2"
, BATCHSIZE = .Machine$integer.max, BATCHBYTES = getOption("ffbatchbytes")
, VERBOSE = FALSE)
ffrowapply(EXPR, X = NULL, N = NULL, NCOL = NULL, VMODE = NULL, VBYTES = NULL
, RETURN = FALSE, RETCOL = NCOL, CFUN = NULL, USE.NAMES = TRUE, FF_RETURN = TRUE
, FROM = "i1", TO = "i2"
, BATCHSIZE = .Machine$integer.max, BATCHBYTES = getOption("ffbatchbytes")
, VERBOSE = FALSE)
ffcolapply(EXPR, X = NULL, N = NULL, NROW = NULL, VMODE = NULL, VBYTES = NULL
, RETURN = FALSE, RETROW = NROW, CFUN = NULL, USE.NAMES = TRUE, FF_RETURN = TRUE
, FROM = "i1", TO = "i2"
, BATCHSIZE = .Machine$integer.max, BATCHBYTES = getOption("ffbatchbytes")
, VERBOSE = FALSE)
ffapply(EXPR = NULL, AFUN = NULL, MARGIN = NULL, X = NULL, N = NULL, DIM = NULL
, VMODE = NULL, VBYTES = NULL, RETURN = FALSE, CFUN = NULL, USE.NAMES = TRUE
, FF_RETURN = TRUE, IDIM = "idim"
, FROM = "i1", TO = "i2", BREAK = ".break"
, BATCHSIZE = .Machine$integer.max, BATCHBYTES = getOption("ffbatchbytes")
, VERBOSE = FALSE)

Arguments

EXPR

the expression to be applied

AFUN

ffapply only: alternatively to EXPR the name of a function to be applied, automatically converted to EXPR

MARGIN

ffapply only: the margins along which to loop in ffapply

an ff object from which several parameters can be derived, if they are not given directly: N, NCOL, NROW, DIM, VMODE, VBYTES, FF_RETURN

the total number of elements in the loop, e.g. number of elements in ffvecapply or number of rows in ffrowapply

NCOL

ffrowapply only: the number of columns needed to calculate batch sizes

NROW

ffcolapply only: the number of rows needed to calculate batch sizes

DIM

ffapply only: the dimension of the array needed to calculate batch sizes

VMODE

the vmode needed to prepare the RETURN object and to derive VBYTES if they are not given directly

VBYTES

the bytes per cell -- see .rambytes -- to calculate the RAM requirements per cell

BATCHBYTES

the max number of bytes per batch, default getOption("ffbatchbytes")

BATCHSIZE

an additional restriction on the number of loop elements, default=.Machine$integer.max

FROM

the name of the index that marks the beginning of the batch, default 'i1', change if needed to avoid naming-conflicts in the calling frame

the name of the index that marks the end of the batch, default 'i2', change if needed to avoid naming-conflicts in the calling frame

IDIM

ffapply only: the name of an R variable used for loop-switching, change if needed to avoid naming-conflicts in the calling frame

BREAK

ffapply only: the name of an R object in the calling frame that triggers break out of the batch loop, if 1) it exists 2) is.logical and 3) is TRUE

RETURN

TRUE to prepare a return value (default FALSE)

CFUN

name of a collapsing function, see CFUN

RETCOL

NULL gives return vector[1:N], RETCOL gives return matrix[1:N, 1:RETCOL]

RETROW

NULL gives return vector[1:N], RETROW gives return matrix[1:RETROW, 1:N]

FF_RETURN

FALSE to return a ram object, TRUE to return an ff object, or an ff object that is ffsuitable to absorb the return data

USE.NAMES

FALSE to suppress attaching names or dimnames to the result

VERBOSE

TRUE to verbose the batches

Value

see details

Details

ffvecapply is the simplest ffapply method for ff_vectors. ffrowapply and ffcolapply is for ff_matrix, and ffapply is the most general method for ff_arrays and ff_vectors.

There are many ways to change the return value of the ffapply functions. In its simplest usage -- batched looping over an expression -- they don't return anything, see invisible. If you switch RETURN=TRUE in ffvecapply then it is assumed that all looped expressions together return one vector of length N, and via parameter FF_RETURN, you can decide whether this vector is in ram or is an ff object (or even which ff object to use). ffrowapply and ffcolapply additionally have parameter RETCOL resp. RETROW which defaults to returning a matrix of the original size; in order to just return a vector of length N set this to NULL, or specify a number of columns/rows for the return matrix. It is assumed that the expression will return appropriate pieces for this return structure (see examples). If you specify RETURN=TRUE and a collapsing function name CFUN, then it is assumed that the batched expressions return aggregated information, which is first collected in a list, and finally the collapsing function is called on this list: do.call(CFUN, list). If you want to return the unmodified list, you have to specify CFUN="list" for obvious reasons.

ffapply allows usages not completly unlike apply: you can specify the name of a function AFUN to be applied over MARGIN. However note that you must specify RETURN=TRUE in order to get a return value. Also note that currently ffapply assumes that your expression returns exactly one value per cell in DIM[MARGINS]. If you want to return something more complicated, you MUST specify a CFUN="list" and your return value will be a list with dim attribute DIM[MARGINS]. This means that for a function AFUN returning a scalar, ffapply behaves very similar to apply, see examples. Note also that ffapply might create a object named '.ffapply.dimexhausted' in its parent frame, and it uses a variable in the parent frame for loop-switching between dimensions, the default name 'idim' can be changed using the IDIM parameter. Finally you can break out of the implied loops by assigning TRUE to a variable with the name in BREAK.

Examples

Run this code

# NOT RUN {
   message("ffvecapply examples")
   x <- ff(vmode="integer", length=1000)
   message("loop evaluate expression without returning anything")
   ffvecapply(x[i1:i2] <- i1:i2, X=x, VERBOSE=TRUE)
   ffvecapply(x[i1:i2] <- i1:i2, X=x, BATCHSIZE=200, VERBOSE=TRUE)
   ffvecapply(x[i1:i2] <- i1:i2, X=x, BATCHSIZE=199, VERBOSE=TRUE)
   message("lets return the combined expressions as a new ff object")
   ffvecapply(i1:i2, N=length(x), VMODE="integer", RETURN=TRUE, BATCHSIZE=200)
   message("lets return the combined expressions as a new ram object")
   ffvecapply(i1:i2, N=length(x), VMODE="integer", RETURN=TRUE, FF_RETURN=FALSE, BATCHSIZE=200)
   message("lets return the combined expressions in existing ff object x")
   x[] <- 0L
   ffvecapply(i1:i2, N=length(x), VMODE="integer", RETURN=TRUE, FF_RETURN=x, BATCHSIZE=200)
   x
   message("aggregate and collapse")
   ffvecapply(summary(x[i1:i2]), X=x, RETURN=TRUE, CFUN="list", BATCHSIZE=200)
   ffvecapply(summary(x[i1:i2]), X=x, RETURN=TRUE, CFUN="crbind", BATCHSIZE=200)
   ffvecapply(summary(x[i1:i2]), X=x, RETURN=TRUE, CFUN="cmean", BATCHSIZE=200)

   message("how to do colSums with ffrowapply")
   x <- ff(1:10000, vmode="integer", dim=c(1000, 10))
   ffrowapply(colSums(x[i1:i2,,drop=FALSE]), X=x, RETURN=TRUE, CFUN="list", BATCHSIZE=200)
   ffrowapply(colSums(x[i1:i2,,drop=FALSE]), X=x, RETURN=TRUE, CFUN="crbind", BATCHSIZE=200)
   ffrowapply(colSums(x[i1:i2,,drop=FALSE]), X=x, RETURN=TRUE, CFUN="csum", BATCHSIZE=200)

   message("further ffrowapply examples")
   x <- ff(1:10000, vmode="integer", dim=c(1000, 10))
   message("loop evaluate expression without returning anything")
   ffrowapply(x[i1:i2, ] <- i1:i2, X=x, BATCHSIZE=200)
   message("lets return the combined expressions as a new ff object (x unchanged)")
   ffrowapply(2*x[i1:i2, ], X=x, RETURN=TRUE, BATCHSIZE=200)
   message("lets return a single row aggregate")
   ffrowapply(t(apply(x[i1:i2,,drop=FALSE], 1, mean)), X=x, RETURN=TRUE, RETCOL=NULL, BATCHSIZE=200)
   message("lets return a 6 column aggregates")
   y <- ffrowapply( t(apply(x[i1:i2,,drop=FALSE], 1, summary)), X=x
   , RETURN=TRUE, RETCOL=length(summary(0)), BATCHSIZE=200)
   colnames(y) <- names(summary(0))
   y
   message("determine column minima if a complete column does not fit into RAM")
   ffrowapply(apply(x[i1:i2,], 2, min), X=x, RETURN=TRUE, CFUN="pmin", BATCHSIZE=200)

   message("ffapply examples")
   x <- ff(1:720, dim=c(8,9,10))
   dimnames(x) <- dummy.dimnames(x)
   message("apply function with scalar return value")
   apply(X=x[], MARGIN=3:2, FUN=sum)
   apply(X=x[], MARGIN=2:3, FUN=sum)
   ffapply(X=x, MARGIN=3:2, AFUN="sum", RETURN=TRUE, BATCHSIZE=8)
   message("this is what CFUN is based on")
   ffapply(X=x, MARGIN=2:3, AFUN="sum", RETURN=TRUE, CFUN="list", BATCHSIZE=8)

   message("apply functions with vector or array return value currently have limited support")
   apply(X=x[], MARGIN=3:2, FUN=summary)
   message("you must use CFUN, the rest is up to you")
   y <- ffapply(X=x, MARGIN=3:2, AFUN="summary", RETURN=TRUE, CFUN="list", BATCHSIZE=8)
   y
   y[[1]]

   rm(x); gc()
# }

Run the code above in your browser using DataLab