addTransform: Add a Transformation Function to a Distributed Data Object

Description

Add a transformation function to be applied to each subset of a distributed data object

Usage

addTransform(obj, fn, name = NULL, params = NULL, packages = NULL)

Arguments

obj

a distributed data object

a function to be applied to each subset of obj - see details

name

optional name of the transformation

params

a named list of objects external to obj that are needed in the transformation function (most should be taken care of automatically such that this is rarely necessary to specify)

packages

a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify)

Value

The distributed data object provided by obj, with the tranformation included as one of the attributes of the returned object.

Details

When you add a transformation to a distributed data object, the transformation is not applied immediately, but is deferred until a function that kicks off a computation is done. These include divide, recombine, drJoin, drLapply, drFilter, drSample, drSubset. When any of these are invoked on an object with a transformation attached to it, the transformation will be applied in the map phase of the MapReduce computation prior to any other computation. The transformation will also be applied any time a subset of the data is requested. Although the data has not been physically transformed after a call of addTransform, we can think of it conceptually as already being transformed.

To force the transformation to be immediately calculated on all subsets use: drPersist(dat, output = ...).

The function provided by fn can either accept one or two parameters. If it accepts one parameter, the value of a key-value pair is passed in. It if accepts two parameters, it is passed the key as the first parameter and the value as the second parameter. The return value of fn is treated as a value of a key-value pair unless the return type comes from kvPair.

When addTransform is called, it is tested on a subset of the data to make sure we have all of the necessary global variables and packages loaded necessary to portably perform the transformation.

It is possible to add multiple transformations to a distributed data object, in which case they are applied in the order supplied, but only one transform should be necessary.

The transformation function must not return NULL on any data subset, although it can return an empty object of the correct shape to match othersubsets (e.g. a data.frame with the correct columns but zero rows).

Examples

Run this code

# NOT RUN {
# Create a distributed data frame using the iris data set, backed by the
# kvMemory (in memory) connection
bySpecies <- divide(iris, by = "Species")
bySpecies
# Note a tranformation is not present in the attributes
names(attributes(bySpecies))
## A transform that operates only on values of the key-value pairs
##----------------------------------------------------------------
# Create a function that will calculate the mean of each variable in
# in a subset. The calls to 'as.data.frame()' and 't()' convert the
# vector output of 'apply()' into a data.frame with a single row
colMean <- function(x) as.data.frame(t(apply(x, 2, mean)))
# Test on a subset
colMean(bySpecies[[1]][[2]])
# Add a tranformation that will calculate the mean of each variable
bySpeciesTransformed <- addTransform(bySpecies, colMean)
# Note how 'before transformation' appears to describe the values of
# several of the attributes
bySpeciesTransformed
# Note the addition of the transformation to the attributes
names(attributes(bySpeciesTransformed))
# We can see the result of the transformation by looking at one of
# the subsets:
bySpeciesTransformed[[1]]
# The transformation is automatically applied when calling any data
# operation.  For example, if can call 'recombine()' with 'combRbind'
# we will get a data frame of the column means for each subset:
varMeans <- recombine(bySpeciesTransformed, combine = combRbind)
varMeans
## A transform that operates on both keys and values
##---------------------------------------------------------
# We can also create a transformation that uses both the keys and values
# It will select the first row of the value, and append '-firstRow' to
# the key
aTransform <- function(key, val) {
  newKey <- paste(key, "firstRow", sep = "-")
  newVal <- val[1,]
  kvPair(newKey, newVal)
}
# Apply the transformation
recombine(addTransform(bySpecies, aTransform))
# }

Run the code above in your browser using DataLab