Learn R Programming

ddR (version 0.1.2)

dframe: Creates a distributed data.frame with the specified partitioning and data.

Description

Creates a distributed data.frame with the specified partitioning and data.

Usage

dframe(nparts = NULL, dim = NULL, psize = NULL, data = 0)
DFrame(nparts = NULL, dim = NULL, psize = NULL, data = 0)

Arguments

nparts
vector specifying number of partitions. If missing, 'psize' and 'dim' must be provided.
dim
the dim attribute for the data.frame to be created. A vector specifying number of rows and columns.
psize
size of each partition as a vector specifying number of rows and columns. This parameter is provided together with dim.
data
initial value of all elements in array. Default is 0.

Value

Returns a distributed data.frame with the specified dimensions. Data may reside as partitions in remote nodes.

Details

Data frame partitions are internally stored as data.frame objects. Last set of partitions may have fewer rows or columns if the dframe dimension is not an integer multiple of partition size. For example, the distributed data.frame 'dframe(dim=c(5,5), psize=c(2,5))' has three partitions. The first two partitions have two rows each but the last partition has only one row. All three partitions have five columns.

Distributed data.frames can also be defined by specifying just the number of partitions, but not their sizes. This flexibility is useful when the size of an dframe is not known apriori. For example, 'dframe(nparts=c(5,1))' is a dense array with five partitions. Each partition can contain any number of rows, though the number of columns should be same to conform to a well formed array.

Distributed data.frames can be fetched at the master using collect. Number of partitions can be obtained by nparts. Partitions are numbered from left to right, and then top to bottom, i.e., row major order. Dimension of each partition can be obtained using psize.

References

Prasad, S., Fard, A., Gupta, V., Martinez, J., LeFevre, J., Xu, V., Hsu, M., Roy, I. Large scale predictive analytics in Vertica: Fast data transfer, distributed model creation and in-database prediction. _Sigmod 2015_, 1657-1668.

Venkataraman, S., Bodzsar, E., Roy, I., AuYoung, A., and Schreiber, R. (2013) Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. _EuroSys 2013_, 197-210.

Homepage: https://github.com/vertica/ddR

See Also

collect psize dmapply

Examples

Run this code
## Not run: 
# ## A 9 partition (each partition 3x3), 9x9 dframe with each element initialized to 5.
# a <- dframe(psize=c(3,3),dim=c(9,9),data=5)
# collect(a)
# b <- dframe(psize=c(3,3),dim=c(9,9)) # Same as 'a', but filled with 0s.
# ## An empty dframe with 6 partitions, 2 per column and 3 per row.
# c <- dframe(nparts=c(2,3))
# ## End(Not run)

Run the code above in your browser using DataLab