runSimulation: Run a Monte Carlo simulation given a data.frame of conditions and simulation functions

Description

This function runs a Monte Carlo simulation study given a set of predefined simulation functions, design conditions, and number of replications. Results can be saved as temporary files in case of interruptions and may be restored by re-running runSimulation, provided that the respective temp file can be found in the working directory. runSimulation supports parallel and cluster computing, global and local debugging, error handling (including fail-safe stopping when functions fail too often, even across nodes), provides bootstrap estimates of the sampling variability (optional), and tracking of error and warning messages. For convenience, all functions available in the R workspace are exported across all computational nodes so that they are more easily accessible (however, other R objects are not, and therefore must be passed to the fixed_objects input to become available across nodes). For a didactic presentation of the package refer to Sigal and Chalmers (2016; 10.1080/10691898.2016.1246953), and see the associated wiki on Github (https://github.com/philchalmers/SimDesign/wiki) for other tutorial material, examples, and applications of SimDesign to real-world simulations.

Usage

runSimulation(design, replications, generate, analyse, summarise,
  fixed_objects = NULL, packages = NULL, bootSE = FALSE,
  boot_draws = 1000L, filename = "SimDesign-results",
  seed = rint(nrow(design), min = 1L, max = 2147483647L), save = FALSE,
  save_results = FALSE, store_results = FALSE,
  warnings_as_errors = FALSE, save_seeds = FALSE, load_seed = NULL,
  parallel = FALSE, ncores = parallel::detectCores(), cl = NULL,
  MPI = FALSE, max_errors = 50L, as.factor = TRUE,
  save_generate_data = FALSE, save_details = list(), edit = "none",
  progress = TRUE, verbose = TRUE)
# S3 method for SimDesign
print(x, drop.extras = FALSE, drop.design = FALSE,
  format.time = TRUE, ...)
# S3 method for SimDesign
head(x, ...)
# S3 method for SimDesign
tail(x, ...)
# S3 method for SimDesign
summary(object, ...)
extract_results(object)
extract_error_seeds(object)
# S3 method for SimDesign
as.data.frame(x, ...)

Arguments

design

a data.frame object containing the Monte Carlo simulation conditions to be studied, where each row represents a unique condition and each column a factor to be varied

replications

number of replication to perform per condition (i.e., each row in design). Must be greater than 0

generate

user-defined data and parameter generating function. See Generate for details. Note that this argument may be omitted by the user if they wish to generate the data with the analyse step, but for real-world simulations this is generally not recommended

analyse

user-defined computation function which acts on the data generated from Generate (or, if generate was omitted, both generates and analyses the simulated data). See Analyse for details

summarise

optional (but highly recommended) user-defined summary function to be used after all the replications have completed within each design condition. Omitting this function will return a list of matrices (or a single matrix, if only one row in design is supplied) or, for more general objects (such as lists), a list containing the results returned form Analyse. Ommiting this function is only recommended for didactic purposes because it leaves out a large amount of information (e.g., try-errors, warning messages, etc), can witness memory related issues, and generally is not as flexible internally. See the save_results option for a more RAM friendly alternative to storing all the Generate-Analyse results in the working environment

fixed_objects

(optional) an object (usually a named list) containing additional user-defined objects that should remain fixed across conditions. This is useful when including long fixed vectors/matrices of population parameters, data that should be used across all conditions and replications (e.g., including a fixed design matrix for linear regression), or simply control constant global elements such as sample size

packages

a character vector of external packages to be used during the simulation (e.g., c('MASS', 'extraDistr', 'simsem') ). Use this input when parallel = TRUE or MPI = TRUE to use non-standard functions from additional packages, otherwise the functions must be made available by using explicit library or require calls within the provided simulation functions. Alternatively, functions can be called explicitly without attaching the package with the :: operator (e.g., extraDistr::rgumbel())

bootSE

logical; perform a non-parametric bootstrap to compute bootstrap standard error estimates for the respective meta-statistics computed by the Summarise function? When TRUE, bootstrap samples for each row in Design will be obtained after the generate-analyse steps have obtain the simulation results to be summarised so that standard errors for each statistic can be computed. To compute large-sample confidence intervals given the bootstrap SE estimates see SimBoot.

This option is useful to approximate how accurate the resulting meta-statistic estimates were, particularly if the number of replications was relatively low (e.g., less than 5000). If users prefer to obtain alternative bootstrap estimates then consider passing save_results = TRUE, reading the generate-analyse data into R via SimResults, and performing the bootstrap manually with function found in the external boot package

boot_draws

number of non-parametric bootstrap draws to sample for the summarise function after the generate-analyse replications are collected. Default is 1000

filename

(optional) the name of the .rds file to save the final simulation results to when save = TRUE. If the same file name already exists in the working directly at the time of saving then a new file will be generated instead and a warning will be thrown. This helps to avoid accidentally overwriting existing files. Default is 'SimDesign-results'

seed

a vector of integers to be used for reproducibility. The length of the vector must be equal the number of rows in design. This argument calls set.seed or clusterSetRNGStream for each condition, respectively, but will not be run when MPI = TRUE. Default randomly generates seeds within the range 1 to 2147483647 for each condition.

save

logical; save the simulation state and final results to the hard-drive? This is useful for simulations which require an extended amount of time. When TRUE, a temp file will be created in the working directory which allows the simulation state to be saved and recovered (in case of power outages, crashes, etc). As well, triggering this flag will save any fatal .Random.seed states when conditions unexpectedly crash (where each seed is stored row-wise in an external .rds file), which provides a much easier mechanism to debug the issue (see load_seed for details).

To recover your simulation at the last known location simply re-run the code you used to initially define the simulation and the external file will automatically be detected and read-in. Upon completion, the final results will be saved to the working directory, and the temp file will be removed. Default is FALSE

save_results

logical; save the results returned from Analyse to external .rds files located in the defined save_results_dirname directory/folder? Use this if you would like to keep track of the individual parameters returned from the analyses. Each saved object will contain a list of three elements containing the condition (row from design), results (as a list or matrix), and try-errors. When TRUE, a temp file will be used to track the simulation state (in case of power outages, crashes, etc). When TRUE, temporary files will also be saved to the working directory (in the same way as when save = TRUE). See SimResults for an example of how to read these .rds files back into R after the simulation is complete. Default is FALSE.

WARNING: saving results to your hard-drive can fill up space very quickly for larger simulations. Be sure to test this option using a smaller number of replications before the full Monte Carlo simulation is performed.

store_results

logical; store the complete tables of simulation results in the returned object? This is FALSE by default to help avoid RAM issues (see save_results as a more suitable alternative). To extract these results pass the returned object to extract_results, which will return a named list of all the simulation results for each condition

warnings_as_errors

logical; treat warning messages as errors during the simulation? Default is FALSE, therefore warnings are only collected and not used to restart the data generation step

save_seeds

logical; save the .Random.seed states prior to performing each replication into plain text files located in the defined save_seeds_dirname directory/folder? Use this if you would like to keep track of the simulation state within each replication and design condition. Primarily, this is useful for completely replicating any cell in the simulation if need be, especially when tracking down hard-to-find errors and bugs. As well, see the load_seed input to load a given .Random.seed to exactly replicate the generated data and analysis state (mostly useful for debugging). When TRUE, temporary files will also be saved to the working directory (in the same way as when save = TRUE). Default is FALSE

load_seed

a character object indicating which file to load from when the .Random.seeds have be saved (after a call with save_seeds = TRUE), or an integer vector indicating the actual .Random.seed values. E.g., load_seed = 'design-row-2/seed-1' will load the first seed in the second row of the design input, or explicilty passing the 626 long elements from .Random.seed (see extract_error_seed() to extract the seeds associated explicilty with errors during the simulation, where each column represents a unique seed). If the input is a character vector then it is important NOT to modify the design input object, otherwise the path may not point to the correct saved location, while if the input is an integer vector then it WILL be important to modify the design input in order to load this exact seed for the corresponding design row. Default is NULL

parallel

logical; use parallel processing from the parallel package over each unique condition?

ncores

number of cores to be used in parallel execution. Default uses all available

cluster object defined by makeCluster used to run code in parallel. If NULL and parallel = TRUE, a local cluster object will be defined which selects the maximum number cores available and will be stop the cluster when the simulation is complete. Note that supplying a cl object will automatically set the parallel argument to TRUE

MPI

logical; use the foreach package in a form usable by MPI to run simulation in parallel on a cluster? Default is FALSE

max_errors

the simulation will terminate when more than this number of consecutive errors are thrown in any given condition. The purpose of this is to indicate that something fatally problematic is likely going wrong in the generate-analyse phases and should be inspected. Default is 50

as.factor

logical; coerce the input design elements into factors when the simulation is complete? If the columns inputs are numeric then these will be treated as ordered. Default is TRUE

save_generate_data

logical; save the data returned from Generate to external .rds files located in the defined save_generate_data_dirname directory/folder? When TRUE, temporary files will also be saved to the working directory (in the same way as when save = TRUE). Default is FALSE

WARNING: saving data to your hard-drive can fill up space very quickly for larger simulations. Be sure to test this option using a smaller number of replications before the full Monte Carlo simulation is performed. It is generally recommended to leave this argument as FALSE because saving datasets will often consume a needless amount of disk space, and by-and-large saving data is not required for simulations.

save_details

a list pertaining to information regarding how and where files should be saved when the save, save_results, or save_generate_data flags are triggered.

safe: logical; trigger whether safe-saving should be performed. When TRUE files will never be overwritten accidentally, and where appropriate the program will either stop or generate new files with unique names. Default is TRUE
compname: name of the computer running the simulation. Normally this doesn't need to be modified, but in the event that a manual node breaks down while running a simulation the results from the temp files may be resumed on another computer by changing the name of the node to match the broken computer. Default is the result of evaluating unname(Sys.info()['nodename'])
out_rootdir: root directory to save all files to. Default uses the current working directory
tmpfilename: the name of the temporary .rds file when any of the save flags are used. This file will be read-in if it is in the working directory and the simulation will continue at the last point this file was saved (useful in case of power outages or broken nodes). Finally, this file will be deleted when the simulation is complete. Default is the system name (compname) appended to 'SIMDESIGN-TEMPFILE_'
save_results_dirname: a string indicating the name of the folder to save result objects to when save_results = TRUE. If a directory/folder does not exist in the current working directory then a unique one will be created automatically. Default is 'SimDesign-results_' with the associated compname appended
save_seeds_dirname: a string indicating the name of the folder to save .Random.seed objects to when save_seeds = TRUE. If a directory/folder does not exist in the current working directory then one will be created automatically. Default is 'SimDesign-seeds_' with the associated compname appended
save_generate_data_dirname: a string indicating the name of the folder to save data objects to when save_generate_data = TRUE. If a directory/folder does not exist in the current working directory then one will be created automatically. Within this folder nested directories will be created associated with each row in design. Default is 'SimDesign-generate-data_' with the compname appended

edit

a string indicating where to initiate a browser() call for editing and debugging. General options are 'none' (default) and 'all', which are used to disable debugging and to debug all the user defined functions, respectively. Specific options include: 'generate' to edit the data simulation function, 'analyse' to edit the computational function, and 'summarise' to edit the aggregation function.

Alternatively, users may place browser calls within the respective functions for debugging at specific lines (note: parallel computation flags will automatically be disabled when a browser() is detected)

progress

logical; display a progress bar for each simulation condition? This is useful when simulations conditions take a long time to run. Uses the pbapply package to display the progress. Default is FALSE

verbose

logical; print messages to the R console? Default is TRUE

SimDesign object returned from runSimulation

drop.extras

logical; don't print information about warnings, errors, simulation time, and replications? Default is FALSE

drop.design

logical; don't include information about the (potentially factorized) simulation design? This may be useful if you wish to cbind() the original design data.frame to the simulation results instead of using the auto-factorized version. Default is FALSE

format.time

logical; format SIM_TIME into a day/hour/min/sec character vector? Default is TRUE

...

additional arguments

object

SimDesign object returned from runSimulation

Value

a data.frame (also of class 'SimDesign') with the original design conditions in the left-most columns, simulation results and ERROR/WARNING's (if applicable) in the middle columns, and additional information (such as REPLICATIONS, SIM_TIME, COMPLETED, and SEED) in the right-most columns.

Saving data, results, seeds, and the simulation state

To conserve RAM, temporary objects (such as data generated across conditions and replications) are discarded; however, these can be saved to the hard-disk by passing the appropriate flags. For longer simulations it is recommended to use save = TRUE to temporarily save the simulation state, and to use the save_results flag to write the analysis results the to hard-disc.

The generated data can be saved by passing save_generate_data = TRUE, however it is often more memory efficient to use the save_seeds option instead to only save R's .Random.seed state instead (still allowing for complete reproducibility); individual .Random.seed terms may also be read in with the load_seed input to reproduce the exact simulation state at any given replication. Finally, providing a vector of seeds is also possible to ensure that each simulation condition is completely reproducible under the single/multi-core method selected.

The load_seed input will also accept an integer vector corresponding to the exact .Random.seed state. This is helpful because SimDesign also tracks these seeds for simulation conditions that threw errors, where these values can be extracted via the extract_error_seeds() function. The column names indicate the respective design row (first number), the order in which the errors were thrown (second number), and finally the error message string (coerced to a proper data.frame column name). After this data.frame object is extracted, individual columns can be passed to load_seed to replicate the exact error issue that appeared (note that the design object must be indexed manually to ensure that the correct design conditions is paired with this exact .Random.seed state).

Finally, when the Monte Carlo simulation is complete it is recommended to write the results to a hard-drive for safe keeping, particularly with the save and filename arguments provided (for reasons that are more obvious in the parallel computation descriptions below). Using the filename argument (along with save = TRUE) supplied is much safer than using something like saveRDS directly because files will never accidentally be overwritten, and instead a new file name will be created when a conflict arises; this type of safety is prevalent in many aspects of the package and helps to avoid many unrecoverable (yet surprisingly common) mistakes.

Resuming temporary results

In the event of a computer crash, power outage, etc, if save = TRUE was used then the original code used to execute runSimulation() need only be re-run to resume the simulation. The saved temp file will be read into the function automatically, and the simulation will continue one the condition where it left off before the simulation state was terminated.

A note on parallel computing

When running simulations in parallel (either with parallel = TRUE or MPI = TRUE) R objects defined in the global environment will generally not be visible across nodes. Hence, you may see errors such as Error: object 'something' not found if you try to use an object that is defined in the workspace but is not passed to runSimulation. To avoid this type or error, simply pass additional objects to the fixed_objects input (usually it's convenient to supply a named list of these objects). Fortunately, however, custom functions defined in the global environment are exported across nodes automatically. This makes it convenient when writing code because custom functions will always be available across nodes if they are visible in the R workspace. As well, note the packages input to declare packages which must be loaded via library() in order to make specific non-standard R functions available across nodes.

Details

The strategy for organizing the Monte Carlo simulation work-flow is to

1): Define a suitable design data.frame object containing fixed conditional information about the Monte Carlo simulations. This is often expedited by using the expand.grid function, and if necessary using the subset function to remove redundant or non-applicable rows
2): Define the three step functions to generate the data (Generate; see also https://CRAN.R-project.org/view=Distributions for a list of distributions in R), analyse the generated data by computing the respective parameter estimates, detection rates, etc (Analyse), and finally summarise the results across the total number of replications (Summarise). Note that these functions can be automatically generated by using the SimFunctions function.
3): Pass the above objects to the runSimulation function, and declare the number of replications to perform with the replications input. This function will accept a design data.frame object and will return a suitable data.frame object with the simulation results
4): Analyze the output from runSimulation, possibly using ANOVA techniques (SimAnova) and generating suitable plots and tables

More succinctly, the functions to be called follow the following form with the exact inputs required by the SimDesign package

Design <- expand.grid(...)
Generate <- function(condition, fixed_objects = NULL) {...}
Analyse <- function(condition, dat, fixed_objects = NULL) {...}
Summarise <- function(condition, results, fixed_objects = NULL) {...}
results <- runSimulation(design=Design, replications, generate=Generate, analyse=Analyse, summarise=Summarise)

The condition object represents a single row from the design object, indicating a unique Monte Carlo simulation condition. The condition object also contains two additional elements to help track the simulation's state: an ID variable, indicating the respective row number in the design object, and a REPLICATION element indicating the replication iteration number. Mainly, these are included to help with debugging, where users can easily locate the rth replication (e.g., REPLICATION == 500) within the jth row in the simulation design (e.g., ID == 2). The REPLICATION input is also useful when temporarily saving files to the hard-drive when calling external command line utilities.

For a skeleton version of the work-flow, which is often useful when initially defining a simulation, see SimFunctions. This function will write template simulation code to one/two files so that modifying the required functions and objects can begin immediately with minimal error. This means that you can focus on your Monte Carlo simulation immediately rather than worrying about the administrative code-work required to organize the simulation work-flow.

Additional information for each condition are also contained in the data.frame object returned by runSimulation: REPLICATIONS to indicate the number of Monte Carlo replications, SIM_TIME to indicate how long (in seconds) it took to complete all the Monte Carlo replications for each respective design condition, COMPLETED to indicate the date in which the given simulation condition completed, SEED for the integer values in the seed argument, columns containing the number of replications which had to be re-run due to errors (where the error messages represent the names of the columns prefixed with a ERROR: string), and columns containing the number of warnings prefixed with a WARNING: string. Finally, if bootSE = TRUE was included then the final right-most columns will contain the labels BOOT_SE. followed by the name of the associated meta-statistic defined in summarise().

Additional examples, presentation files, and tutorials can be found on the package wiki located at https://github.com/philchalmers/SimDesign/wiki.

References

Sigal, M. J., & Chalmers, R. P. (2016). Play it again: Teaching statistics with Monte Carlo simulation. Journal of Statistics Education, 24(3), 136-156. 10.1080/10691898.2016.1246953

Examples

Run this code

# NOT RUN {
#-------------------------------------------------------------------------------
# Example 1: Sampling distribution of mean

# This example demonstrate some of the simpler uses of SimDesign,
# particularly for classroom settings. The only factor varied in this simulation
# is sample size.

# skeleton functions to be saved and edited
SimFunctions()

#### Step 1 --- Define your conditions under study and create design data.frame

Design <- data.frame(N = c(10, 20, 30))

#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 2 --- Define generate, analyse, and summarise functions

# help(Generate)
Generate <- function(condition, fixed_objects = NULL){
    dat <- with(condition, rnorm(N, 10, 5)) # distributed N(10, 5)
    dat
}

# help(Analyse)
Analyse <- function(condition, dat, fixed_objects = NULL){
    ret <- mean(dat) # mean of the sample data vector
    ret
}

# help(Summarise)
Summarise <- function(condition, results, fixed_objects = NULL){
    ret <- c(mu=mean(results), SE=sd(results)) # mean and SD summary of the sample means
    ret
}


#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 3 --- Collect results by looping over the rows in design

# run the simulation
Final <- runSimulation(design=Design, replications=1000,
                       generate=Generate, analyse=Analyse, summarise=Summarise)
Final

# reproduce exact simulation
Final_rep <- runSimulation(design=Design, replications=1000, seed=Final$SEED,
                       generate=Generate, analyse=Analyse, summarise=Summarise)
Final_rep


#~~~~~~~~~~~~~~~~~~~~~~~~
#### Extras
# compare SEs estimates to the true SEs from the formula sigma/sqrt(N)
5 / sqrt(Design$N)

# To store the results from the analyse function either
#   a) omit a definition of of summarise(), or
#   b) pass save_results = TRUE to runSimulation() and read the results in with SimResults()

# e.g., the a) approach
results <- runSimulation(design=Design, replications=1000,
                       generate=Generate, analyse=Analyse)
str(results)
head(results[[1]])

# or b) approach
Final <- runSimulation(design=Design, replications=1000, save_results=TRUE,
                       generate=Generate, analyse=Analyse, summarise=Summarise)
results <- SimResults(Final)
str(results)
head(results[[1]]$results)

# remove the saved results from the hard-drive if you no longer want them
SimClean(results = TRUE)




#-------------------------------------------------------------------------------
# Example 2: t-test and Welch test when varying sample size, group sizes, and SDs

# skeleton functions to be saved and edited
SimFunctions()

# }
# NOT RUN {
# in real-world simulations it's often better/easier to save
# these functions directly to your hard-drive with
SimFunctions('my-simulation')
# }
# NOT RUN {
#### Step 1 --- Define your conditions under study and create design data.frame

Design <- expand.grid(sample_size = c(30, 60, 90, 120),
                      group_size_ratio = c(1, 4, 8),
                      standard_deviation_ratio = c(.5, 1, 2))
dim(Design)
head(Design)

#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 2 --- Define generate, analyse, and summarise functions

Generate <- function(condition, fixed_objects = NULL){
    N <- condition$sample_size      # alternatively, could use Attach() to make objects available
    grs <- condition$group_size_ratio
    sd <- condition$standard_deviation_ratio
    if(grs < 1){
        N2 <- N / (1/grs + 1)
        N1 <- N - N2
    } else {
        N1 <- N / (grs + 1)
        N2 <- N - N1
    }
    group1 <- rnorm(N1)
    group2 <- rnorm(N2, sd=sd)
    dat <- data.frame(group = c(rep('g1', N1), rep('g2', N2)), DV = c(group1, group2))
    dat
}

Analyse <- function(condition, dat, fixed_objects = NULL){
    welch <- t.test(DV ~ group, dat)
    ind <- t.test(DV ~ group, dat, var.equal=TRUE)

    # In this function the p values for the t-tests are returned,
    #  and make sure to name each element, for future reference
    ret <- c(welch = welch$p.value, independent = ind$p.value)
    ret
}

Summarise <- function(condition, results, fixed_objects = NULL){
    #find results of interest here (e.g., alpha < .1, .05, .01)
    ret <- EDR(results, alpha = .05)
    ret
}


#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 3 --- Collect results by looping over the rows in design

# first, test to see if it works
Final <- runSimulation(design=Design, replications=5, store_results=TRUE,
                       generate=Generate, analyse=Analyse, summarise=Summarise)
head(Final)

# }
# NOT RUN {
# complete run with 1000 replications per condition
Final <- runSimulation(design=Design, replications=1000, parallel=TRUE,
                       generate=Generate, analyse=Analyse, summarise=Summarise)
head(Final, digits = 3)
View(Final)

## save final results to a file upon completion (not run)
runSimulation(design=Design, replications=1000, parallel=TRUE, save=TRUE, filename = 'mysim',
              generate=Generate, analyse=Analyse, summarise=Summarise)



## Debug the generate function. See ?browser for help on debugging
##   Type help to see available commands (e.g., n, c, where, ...),
##   ls() to see what has been defined, and type Q to quit the debugger
runSimulation(design=Design, replications=1000,
              generate=Generate, analyse=Analyse, summarise=Summarise,
              parallel=TRUE, edit='generate')

## Alternatively, place a browser() within the desired function line to
##   jump to a specific location
Summarise <- function(condition, results, fixed_objects = NULL){
    #find results of interest here (e.g., alpha < .1, .05, .01)
    ret <- EDR(results[,nms], alpha = .05)
    browser()
    ret
}

runSimulation(design=Design, replications=1000,
              generate=Generate, analyse=Analyse, summarise=Summarise,
              parallel=TRUE)




## EXTRA: To run the simulation on a MPI cluster, use the following setup on each node (not run)
# library(doMPI)
# cl <- startMPIcluster()
# registerDoMPI(cl)
# Final <- runSimulation(design=Design, replications=1000, MPI=TRUE, save=TRUE,
#                        generate=Generate, analyse=Analyse, summarise=Summarise)
# saveRDS(Final, 'mysim.rds')
# closeCluster(cl)
# mpi.quit()


## Similarly, run simulation on a network linked via ssh
##  (two way ssh key-paired connection must be possible between master and slave nodes)
##
## define IP addresses, including primary IP
# primary <- '192.168.2.20'
# IPs <- list(
#     list(host=primary, user='phil', ncore=8),
#     list(host='192.168.2.17', user='phil', ncore=8)
# )
# spec <- lapply(IPs, function(IP)
#                    rep(list(list(host=IP$host, user=IP$user)), IP$ncore))
# spec <- unlist(spec, recursive=FALSE)
#
# cl <- parallel::makeCluster(type='PSOCK', master=primary, spec=spec)
# Final <- runSimulation(design=Design, replications=1000, parallel = TRUE, save=TRUE,
#                        generate=Generate, analyse=Analyse, summarise=Summarise, cl=cl)

#~~~~~~~~~~~~~~~~~~~~~~~~
###### Post-analysis: Analyze the results via functions like lm() or SimAnova(), and create
###### tables(dplyr) or plots (ggplot2) to help visualize the results.
###### This is where you get to be a data analyst!

library(dplyr)
Final2 <- tbl_df(Final)
Final2 %>% summarise(mean(welch), mean(independent))
Final2 %>% group_by(standard_deviation_ratio, group_size_ratio) %>%
   summarise(mean(welch), mean(independent))

# quick ANOVA analysis method with all two-way interactions
SimAnova( ~ (sample_size + group_size_ratio + standard_deviation_ratio)^2, Final)

# or more specific ANOVAs
SimAnova(independent ~ (group_size_ratio + standard_deviation_ratio)^2,
    Final)

# make some plots
library(ggplot2)
library(reshape2)
welch_ind <- Final[,c('group_size_ratio', "standard_deviation_ratio",
    "welch", "independent")]
dd <- melt(welch_ind, id.vars = names(welch_ind)[1:2])

ggplot(dd, aes(factor(group_size_ratio), value)) + geom_boxplot() +
    geom_abline(intercept=0.05, slope=0, col = 'red') +
    geom_abline(intercept=0.075, slope=0, col = 'red', linetype='dotted') +
    geom_abline(intercept=0.025, slope=0, col = 'red', linetype='dotted') +
    facet_wrap(~variable)

ggplot(dd, aes(factor(group_size_ratio), value, fill = factor(standard_deviation_ratio))) +
    geom_boxplot() + geom_abline(intercept=0.05, slope=0, col = 'red') +
    geom_abline(intercept=0.075, slope=0, col = 'red', linetype='dotted') +
    geom_abline(intercept=0.025, slope=0, col = 'red', linetype='dotted') +
    facet_grid(variable~standard_deviation_ratio) +
    theme(legend.position = 'none')

# }
# NOT RUN {
# }

Run the code above in your browser using DataLab