Learn R Programming

simFrame (version 0.5.4)

clusterRunSimulation: Run a simulation experiment on a cluster

Description

Generic function for running a simulation experiment on a cluster.

Usage

clusterRunSimulation(cl, x, setup, nrep, control,
                     contControl = NULL, NAControl = NULL,
                     design = character(), fun, …,
                     SAE = FALSE)

Arguments

cl

a cluster as generated by makeCluster.

x

a data.frame (for design-based simulation or simulation based on real data) or a control object for data generation inheriting from "VirtualDataControl" (for model-based simulation or mixed simulation designs).

setup

an object of class "SampleSetup", containing previously set up samples, or a control class for setting up samples inheriting from "VirtualSampleControl".

nrep

a non-negative integer giving the number of repetitions of the simulation experiment (for model-based simulation, mixed simulation designs or simulation based on real data).

control

a control object of class "SimControl"

contControl

an object of a class inheriting from "VirtualContControl", controlling contamination in the simulation experiment.

NAControl

an object of a class inheriting from "VirtualNAControl", controlling the insertion of missing values in the simulation experiment.

design

a character vector specifying variables (columns) to be used for splitting the data into domains. The simulations, including contamination and the insertion of missing values (unless SAE=TRUE), are then performed on every domain.

fun

a function to be applied in each simulation run.

for runSimulation, additional arguments to be passed to fun. For runSim, arguments to be passed to runSimulation.

SAE

a logical indicating whether small area estimation will be used in the simulation experiment.

Value

An object of class "SimResults".

Methods

cl = "ANY", x = "ANY", setup = "ANY", nrep = "ANY", control = "missing"

convenience wrapper that allows the slots of control to be supplied as arguments

cl = "ANY", x = "data.frame", setup = "missing", nrep = "numeric", control = "SimControl"

run a simulation experiment based on real data with repetitions on a cluster.

cl = "ANY", x = "data.frame", setup = "SampleSetup", nrep = "missing", control = "SimControl"

run a design-based simulation experiment with previously set up samples on a cluster.

cl = "ANY", x = "data.frame", setup = "VirtualSampleControl", nrep = "missing", control = "SimControl"

run a design-based simulation experiment on a cluster.

cl = "ANY", x = "VirtualDataControl", setup = "missing", nrep = "numeric", control = "SimControl"

run a model-based simulation experiment with repetitions on a cluster.

cl = "ANY", x = "VirtualDataControl", setup = "VirtualSampleControl", nrep = "numeric", control = "SimControl"

run a simulation experiment using a mixed simulation design with repetitions on a cluster.

Details

Statistical simulation is embarrassingly parallel, hence computational performance can be increased by parallel computing. Since version 0.5.0, parallel computing in simFrame is implemented using the package parallel, which is part of the R base distribution since version 2.14.0 and builds upon work done for the contributed packages multicore and snow. Note that all objects and packages required for the computations (including simFrame) need to be made available on every worker process unless the worker processes are created by forking (see makeCluster).

In order to prevent problems with random numbers and to ensure reproducibility, random number streams should be used. With parallel, random number streams can be created via the function clusterSetRNGStream().

There are some requirements for slot fun of the control object control. The function must return a numeric vector, or a list with the two components values (a numeric vector) and add (additional results of any class, e.g., statistical models). Note that the latter is computationally slightly more expensive. A data.frame is passed to fun in every simulation run. The corresponding argument must be called x. If comparisons with the original data need to be made, e.g., for evaluating the quality of imputation methods, the function should have an argument called orig. If different domains are used in the simulation, the indices of the current domain can be passed to the function via an argument called domain.

For small area estimation, the following points have to be kept in mind. The slot design of control for splitting the data must be supplied and the slot SAE must be set to TRUE. However, the data are not actually split into the specified domains. Instead, the whole data set (sample) is passed to fun. Also contamination and missing values are added to the whole data (sample). Last, but not least, the function must have a domain argument so that the current domain can be extracted from the whole data (sample).

In every simulation run, fun is evaluated using try. Hence no results are lost if computations fail in any of the simulation runs.

References

Alfons, A., Templ, M. and Filzmoser, P. (2010) An Object-Oriented Framework for Statistical Simulation: The R Package simFrame. Journal of Statistical Software, 37(3), 1--36. 10.18637/jss.v037.i03.

L'Ecuyer, P., Simard, R., Chen E and Kelton, W. (2002) An Object-Oriented Random-Number Package with Many Long Streams and Substreams. Operations Research, 50(6), 1073--1075.

Rossini, A., Tierney L. and Li, N. (2007) Simple Parallel Statistical Computing in R. Journal of Computational and Graphical Statistics, 16(2), 399--420.

Tierney, L., Rossini, A. and Li, N. (2009) snow: A Parallel Computing Framework for the R System. International Journal of Parallel Programming, 37(1), 78--90.

See Also

makeCluster, clusterSetRNGStream, runSimulation, "'>SimControl", "'>SimResults", simBwplot, simDensityplot, simXyplot

Examples

Run this code
# NOT RUN {
## these examples requires at least a dual core processor


## design-based simulation
data(eusilcP)  #load data

# start cluster
cl <- makeCluster(2, type = "PSOCK")

# load package and data on workers
clusterEvalQ(cl, {
    library(simFrame)
    data(eusilcP)
})

# set up random number stream
clusterSetRNGStream(cl, iseed = "12345")

# control objects for sampling and contamination
sc <- SampleControl(size = 500, k = 50)
cc <- DARContControl(target = "eqIncome", epsilon = 0.02,
    fun = function(x) x * 25)

# function for simulation runs
sim <- function(x) {
    c(mean = mean(x$eqIncome), trimmed = mean(x$eqIncome, 0.02))
}

# export objects to workers
clusterExport(cl, c("sc", "cc", "sim"))

# run simulation on cluster
results <- clusterRunSimulation(cl, eusilcP,
    sc, contControl = cc, fun = sim)

# stop cluster
stopCluster(cl)

# explore results
head(results)
aggregate(results)
tv <- mean(eusilcP$eqIncome)  # true population mean
plot(results, true = tv)



## model-based simulation

# start cluster
cl <- makeCluster(2, type = "PSOCK")

# load package on workers
clusterEvalQ(cl, library(simFrame))

# set up random number stream
clusterSetRNGStream(cl, iseed = "12345")

# function for generating data
rgnorm <- function(n, means) {
    group <- sample(1:2, n, replace=TRUE)
    data.frame(group=group, value=rnorm(n) + means[group])
}

# control objects for data generation and contamination
means <- c(0, 0.25)
dc <- DataControl(size = 500, distribution = rgnorm,
    dots = list(means = means))
cc <- DCARContControl(target = "value",
    epsilon = 0.02, dots = list(mean = 15))

# function for simulation runs
sim <- function(x) {
    c(mean = mean(x$value),
        trimmed = mean(x$value, trim = 0.02),
        median = median(x$value))
}

# export objects to workers
clusterExport(cl, c("rgnorm", "means", "dc", "cc", "sim"))

# run simulation on cluster
results <- clusterRunSimulation(cl, dc, nrep = 100,
    contControl = cc, design = "group", fun = sim)

# stop cluster
stopCluster(cl)

# explore results
head(results)
aggregate(results)
plot(results, true = means)
# }

Run the code above in your browser using DataLab