recombine: Recombine

Description

Apply an analytic recombination method to a ddo/ddf object and combine the results

Usage

recombine(data, combine = NULL, apply = NULL, output = NULL,
  overwrite = FALSE, params = NULL, packages = NULL, control = NULL,
  verbose = TRUE)

Arguments

data

an object of class "ddo" of "ddf"

combine

the method to combine the results. See, for example, combCollect, combDdf, combDdo, combRbind, etc. If combine = NULL, combCollect will be used if output = NULL and combDdo is used if output is specified.

apply

a function specifying the analytic method to apply to each subset, or a pre-defined apply function (see drBLB, drGLM, for example). NOTE: This argument is now deprecated in favor of addTransform

output

a "kvConnection" object indicating where the output data should reside (see localDiskConn, hdfsConn). If NULL (default), output will be an in-memory "ddo" object

overwrite

logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak)

params

a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify)

packages

a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify)

control

parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl

verbose

logical - print messages about what is being done

Value

Depends on combine: this could be a distributed data object, a data frame, a key-value list, etc. See examples.

References

Examples

Run this code

# NOT RUN {
## in-memory example
##---------------------------------------------------------

# begin with an in-memory ddf (backed by kvMemory)
bySpecies <- divide(iris, by = "Species")

# create a function to calculate the mean for each variable
colMean <- function(x) data.frame(lapply(x, mean))

# apply the transformation
bySpeciesTransformed <- addTransform(bySpecies, colMean)

# recombination with no 'combine' argument and no argument to output
# produces the key-value list produced by 'combCollect()'
recombine(bySpeciesTransformed)

# but we can also preserve the distributed data frame, like this:
recombine(bySpeciesTransformed, combine = combDdf)

# or we can recombine using 'combRbind()' and produce a data frame:
recombine(bySpeciesTransformed, combine = combRbind)

## local disk connection example with parallelization
##---------------------------------------------------------

# create a 2-node cluster that can be used to process in parallel
cl <- parallel::makeCluster(2)

# create the control object we'll pass into local disk datadr operations
control <- localDiskControl(cluster = cl)
# note that setting options(defaultLocalDiskControl = control)
# will cause this to be used by default in all local disk operations

# create local disk connection to hold bySpecies data
ldPath <- file.path(tempdir(), "by_species")
ldConn <- localDiskConn(ldPath, autoYes = TRUE)

# convert in-memory bySpecies to local-disk ddf
bySpeciesLD <- convert(bySpecies, ldConn)

# apply the transformation
bySpeciesTransformed <- addTransform(bySpeciesLD, colMean)

# recombine the data using the transformation
bySpeciesMean <- recombine(bySpeciesTransformed,
  combine = combRbind, control = control)
bySpeciesMean

# remove temporary directories
unlink(ldPath, recursive = TRUE)

# shut down the cluster
parallel::stopCluster(cl)
# }

Run the code above in your browser using DataLab