Learn R Programming

simFrame (version 0.1.2)

clusterSetup: Set up multiple samples on a snow cluster

Description

Generic function for setting up multiple samples on a snow cluster.

Usage

clusterSetup(cl, x, control, ...)

## S3 method for class 'ANY,data.frame,SampleControl': clusterSetup(cl, x, control)

Arguments

cl
a snow cluster.
x
the data.frame to sample from.
control
a control object inheriting from the virtual class "VirtualSampleControl" or a character string specifying such a control class (the default being "SampleControl").
...
if control is a character string or missing, the slots of the control object may be supplied as additional arguments.

Value

  • An object of class "SampleSetup".

Details

The computational performance of setting up multiple samples can be increased by parallel computing. In simFrame, parallel computing is implemented using the package snow. Note that all objects and packages required for the computations (including simFrame) need to be made available on every worker process. In order to prevent problems with random numbers and to ensure reproducibility, random number streams should be used. In R, the packages rlecuyer and rsprng are available for creating random number streams, which are supported by snow via the function clusterSetupRNG. The control class "SampleControl" is highly flexible and allows stratified sampling as well as sampling of whole groups rather than individuals with a specified sampling method. Hence it is often sufficient to implement the desired sampling method for the simple non-stratified case to extend the existing framework. See "SampleControl" for some restrictions on the argument names of such a function, which should return a vector containing the indices of the sampled observations. Nevertheless, for very complex sampling procedures, it is possible to define a control class "MySampleControl" extending "VirtualSampleControl", and the corresponding method clusterSetup(cl, x, control) with signature 'ANY, data.frame, MySampleControl'. In order to optimize computational performance, it is necessary to efficiently set up multiple samples. Thereby the slot k of "VirtualSampleControl" needs to be used to control the number of samples, and the resulting object must be of class "SampleSetup".

References

L'Ecuyer, P., Simard, R., Chen E and Kelton, W. (2002) An object-oriented random-number package with many long streams and substreams. Operations Research, 50(6), 1073--1075.

Mascagni, M. and Srinivasan, A. (2000) Algorithm 806: SPRNG: a scalable library for pseudorandom number generation. ACM Transactions on Mathematical Software, 26(3), 436--461.

Rossini, A., Tierney L. and Li, N. (2007) Simple parallel statistical computing in R. Journal of Computational and Graphical Statistics, 16(2), 399--420.

Tierney, L., Rossini, A. and Li, N. (2009) snow: A parallel computing framework for the Rsystem. International Journal of Parallel Programming, 37(1), 78--90.

See Also

makeCluster, clusterSetupRNG, setup, draw, SampleControl, VirtualSampleControl, SampleSetup

Examples

Run this code
# these examples require at least dual core processor

# load data
data(eusilc)

# start snow cluster
cl <- makeCluster(2, type = "SOCK")

# load package and data on workers
clusterEvalQ(cl, {
        library(simFrame)
        data(eusilc)
    })

# simple random sampling
srss <- clusterSetup(cl, eusilc, size = 20, k = 4)
draw(eusilc[, c("id", "eqIncome")], srss, i = 1)

# group sampling
gss <- clusterSetup(cl, eusilc, group = "hid", size = 10, k = 4)
draw(eusilc[, c("hid", "id", "eqIncome")], gss, i = 2)

# stratified sampling
stss <- clusterSetup(cl, eusilc, design = "region", 
    size = c(2, 5, 5, 3, 4, 5, 3, 5, 2), k = 4)
draw(eusilc[, c("id", "region", "eqIncome")], stss, i = 3)

# stop cluster
stopCluster(cl)

Run the code above in your browser using DataLab