Learn R Programming


title: "ddR README" author: "Edward Ma, Indrajit Roy, Michael Lawrence" date: "2015-10-22" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Vignette Title} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc}

The 'ddR' package aims to provide an unified R interface for writing parallel and distributed applications. Our goal is to ensure that R programs written using the 'ddR' API work across different distributed backends, therefore, reducing the effort required by users to understand and program on different backends. Currently 'ddR' programs can be executed on R's default 'parallel' package as well as the open source HP Distributed R. We plan to add support for SparkR. This package is an outcome of feedback and collaboration across different companies and R-core members!

'ddR' is an API, and includes a default execution engine, to express and execute distributed applications. Users can declare distributed objects (i.e., dlist, dframe, darray), and execute parallel operations on these data structures using R-style apply functions. It also allows different backends (that support ddR, and have ddR "drivers" written for them) to be dynamically activated in the R user's environment to execute applications

Please refer to the user guide under vignettes/ for a detailed description on how to use the package.

Some quick examples

library(ddR)

By default, the parallel backend is used with all the cores present on the machine. You can switch backends or specify the number of cores to use with the useBackend function. For example, you can specify that the parallel backend should be used with only 4 cores by executing useBackend(parallel, executors=4).

Initializing a distributed list (dlist):

a <- dmapply(function(x) { x }, rep(3,5))
collect(a)
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 3
## 
## [[5]]
## [1] 3

Printing a:

a
## 
## ddR Distributed Object
## Type: dlist
## # of partitions: 5
## Partitions per dimension: 5x1
## Partition sizes: [1], [1], [1], [1], [1]
## Length: 5
## Backend: parallel

a is a distributed object in ddR. Note that we did not specify the number of partitions of the output, but by default it is equal to the length of the inputs (5). Use the parameter nparts to specify how the output should be partitioned:

Below is the code to add 1 to the first element of a, 2 to the second, etc. The syntax of dmapply is similar to R's standard mapply function.

b <- dmapply(function(x,y) { x + y }, a, 1:5,nparts=1)
b
## 
## ddR Distributed Object
## Type: dlist
## # of partitions: 1
## Partitions per dimension: 1x1
## Partition sizes: [5]
## Length: 5
## Backend: parallel

Since we specified nparts=1 in dmapply, b only has one partition of 5 elements. Note that the argument nparts is optional, and a user can always ignore it.

collect(b)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 5
## 
## [[3]]
## [1] 6
## 
## [[4]]
## [1] 7
## 
## [[5]]
## [1] 8

Some other operations: `

Adding a to b, and then subtracting a constant value

addThenSubtract <- function(x,y,z) {
  x + y - z
}
c <- dmapply(addThenSubtract,a,b,MoreArgs=list(z=5))
collect(c)
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 4
## 
## [[4]]
## [1] 5
## 
## [[5]]
## [1] 6

We can also process distributed objects partitionwise. Below is an example where we calculate the length of each partition:

d <- dmapply(function(x) length(x),parts(a))
collect(d)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 1
## 
## [[5]]
## [1] 1

We partitioned a with 5 parts and it had 5 elements, so the length of each partition is 1.

However, b only had one partition, so that one partition should be of length 5:

e <- dmapply(function(x) length(x),parts(b))
collect(e)
## [[1]]
## [1] 5

Note that parts() and non-parts arguments can be used in any combination to dmapply. parts(dobj) returns a list of the partitions of that dobject, which can be passed into dmapply like any other list. parts(dobj,index), where index is a list, vector, or scalar, returns a specific partition or range of partitions of dobj.

We also have support for darrays and dframes. Check vignettes/ on how to use them.

For more interesting parallel machine learning algorithms, you may view (and run) the example scripts under /examples.

Using the Distributed R backend

To use the Distributed R library for ddR, first install distributedR.ddR and then load it:

library(distributedR.ddR)
## Loading required package: distributedR
## Loading required package: Rcpp
## Loading required package: RInside
## Loading required package: XML
## Loading required package: ddR
## 
## Attaching package: 'ddR'
## 
## The following objects are masked from 'package:distributedR':
## 
##     darray, dframe, dlist, is.dlist
useBackend(distributedR)
## Master address:port - 127.0.0.1:50000

Now you can try the different list examples which were used with the 'parallel' backend.

How to Contribute

You can help us in different ways:

  1. Reporting issues.
  2. Contributing code and sending a pull request.

In order to contribute the code base of this project, you must agree to the Developer Certificate of Origin (DCO) 1.1 for this project under GPLv2+:

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I have the 
    right to submit it under the open source license indicated in the file; or
(b) The contribution is based upon previous work that, to the best of my 
    knowledge, is covered under an appropriate open source license and I 
    have the right under that license to submit that work with modifications, 
    whether created in whole or in part by me, under the same open source 
    license (unless I am permitted to submit under a different license), 
    as indicated in the file; or
(c) The contribution was provided directly to me by some other person who 
    certified (a), (b) or (c) and I have not modified it.
(d) I understand and agree that this project and the contribution are public and
    that a record of the contribution (including all personal information I submit 
    with it, including my sign-off) is maintained indefinitely and may be 
    redistributed consistent with this project or the open source license(s) involved.

To indicate acceptance of the DCO you need to add a Signed-off-by line to every commit. E.g.:

Signed-off-by: John Doe <john.doe@hisdomain.com>

To automatically add that line use the -s switch when running git commit:

$ git commit -s

Copy Link

Version

Install

install.packages('ddR')

Monthly Downloads

28

Version

0.1.2

License

GPL (>= 2) | file LICENSE

Maintainer

Edward Ma

Last Published

November 25th, 2015

Functions in ddR (0.1.2)

dlist

Creates a distributed list with the specified partitioning and data.
collect

Fetch partition(s) of 'darray', 'dframe' or 'dlist' from remote workers.
colnames,DObject-method

Gets the colnames for the distributed object.
cbind,DObject-method

Column binds the objects.
as.darray

Convert input matrix into a distributed array.
dimnames<-,DObject,list-method

Sets the dimnames for the distributed object.
$,DObject-method

Extracts elements of a distributed object matching the name.
rbind,DObject-method

row binds the arguments
repartition

Repartitions a distributed object. This function takes two inputs, a distributed object and a skeleton. These inputs must both be distributed objects of the same type and same dimension. If 'dobj' and 'skeleton' have different internal partitioning, this function will return a new distributed object with the same internal data as in 'dobj' but with the partitioning scheme of 'skeleton'.
colMeans,DObject-method

Gets the column means for a distributed array or data.frame.
useBackend

Sets the active backend driver. Functions exported by the 'ddR' package are dispatched to the backend driver. Backend-specific initialization parameters may be passed into the ellipsis (...) part of the function arguments.
is.dobject

Returns whether the input entity is a DObject
ddR

Distributed Data-structures in R
init

Called when the backend driver is initialized.
combine

Combines a list of partitions into a single distributed object. (can be implemented by a frontend wrapper without actually combining data in storage).
dmapply

Distributed version of mapply. Similar to R's 'mapply', it allows a multivariate function, FUN, to be applied to several inputs. Unlike standard mapply, it always returns a distributed object.
is.dlist

Returns whether the input is a dlist
do_collect

Backend implemented function to move data from storage to the calling context (node).
getPartitionIdsAndOffsets

Gets the internal set of partitions, and offsets within each partition, of a set 1d or 2d-subset indices for a distributed object
parts

Retrieves, as a list of independent objects, pointers to each individual partition of the input.
getBestOutputPartitioning

This is an overrideable function that determines what the output partitioning scheme of a dlapply or dmapply function should be. It determines the 'ideal' nparts for the output if it is not supplied. For API standard-enforcement, overriding this is not recommended.
names<-,DObject-method

Sets the names of a distributed object
sum,DObject-method

Gets the sum of the objects.
colSums,DObject-method

Get the column sums for a distributed array or data.frame.
as.dframe

Convert input matrix or data.frame into a distributed data.frame.
nparts

Returns a 2d-vector denoting the number of partitions existing along each dimension of the distributed object, where the vector==c(partitions_per_column, partitions_per_row). For a dlist, the value is equivalent to c(totalParts(dobj),1).
is.sparse_darray

Returns whether the input is a sparse_darray
[[,DObject,numeric-method

Extracts a single element of a distributed object.
do_dmapply

Backend-specific dmapply logic. This is a required override for all backends to implement so dmapply works.
ddRDriver-class

The base S4 class for backend driver classes to extend.
DObject-class

The baseline distributed object class to be extended by each backend driver. Backends may elect to extend once for all distributed object types ('dlist', 'darray', 'dframe,', etc.) for one per type, depending on needs.
parallel

The default parallel driver
mean,DObject-method

Gets the mean value of the elements within the object.
[

Extract parts of a distributed object.
rowSums,DObject-method

Gets the row sums for a distributed array or data.frame.
darray

Creates a distributed array with the specified partitioning and contents.
is.dframe

Returns whether the input is a dframe
get_parts

Gets the partitions to a distributed object, given an index.
as.dlist

Creates a distributed list from the input.
rbind

rbindddR
is.darray

Returns whether the input is a darray
totalParts

Returns the total number of partitions of the distributed object. The result is same as prod(nparts(dobj))
rowMeans,DObject-method

Gets the row means for a distributed array or data.frame.
rownames,DObject-method

Gets the rownames for the distributed object.
psize

Return sizes of each partition of the input distributed object.
dframe

Creates a distributed data.frame with the specified partitioning and data.
dimnames,DObject-method

Gets the dimnames for the distributed object.
shutdown

Called when the backend driver is shutdown.
dlapply

Distributed version of 'lapply'. Similar to dmapply, but permits only one iterable argument, and output.type is always 'dlist'.