elevate: Tune, Train, and Test an rtemis Learner by Nested Resampling

Description

elevate is a high-level function to tune, train, and test an rtemis model by nested resampling, with optional preprocessing and decomposition of input features

Usage

elevate(x, y = NULL, mod = "ranger", mod.params = list(),
  .preprocess = NULL, .decompose = NULL, .resample = NULL,
  weights = NULL, resampler = "strat.sub", n.resamples = 10,
  n.repeats = 1, stratify.var = NULL, train.p = 0.8,
  strat.n.bins = 4, target.length = NULL, seed = NULL,
  res.index = NULL, res.group = NULL, bag.fn = median,
  x.name = NULL, y.name = NULL, save.mods = TRUE, save.tune = TRUE,
  cex = 1.4, col = "#18A3AC", bag.fitted = FALSE, n.cores = 1,
  parallel.type = ifelse(.Platform$OS.type == "unix", "fork", "psock"),
  print.plot = TRUE, plot.fitted = FALSE, plot.predicted = TRUE,
  plot.theme = getOption("rt.fit.theme", "lightgrid"),
  print.res.plot = FALSE, question = NULL, verbose = TRUE,
  trace = 0, res.verbose = FALSE, headless = FALSE, outdir = NULL,
  save.plots = FALSE, save.rt = ifelse(!is.null(outdir), TRUE, FALSE),
  save.mod = TRUE, save.res = FALSE, ...)

Arguments

Numeric vector or matrix / data frame of features i.e. independent variables

Numeric vector of outcome, i.e. dependent variable

mod

String: Learner to use. Options: modSelect

mod.params

Optional named list of parameters to be passed to mod. All parameters can be passed as part of ... as well

.preprocess

Optional named list of parameters to be passed to preprocess. Set using rtset.preprocess, e.g. decom = rtset.preprocess(impute = TRUE)

.decompose

Optional named list of parameters to be used for decomposition / dimensionality reduction. Set using rtset.decompose, e.g. decom = rtset.decompose("ica", 12)

.resample

Optional names list of parameters to be passed to resample. NOTE: If set, this takes precedence over setting the individual resampling arguments ()

weights

Numeric vector: Weights for cases. For classification, weights takes precedence over ipw, therefore set weights = NULL if using ipw. Note: If weight are provided, ipw is not used. Leave NULL if setting ipw = TRUE. Default = NULL

resampler

String: Type of resampling to perform: "bootstrap", "kfold", "strat.boot", "strat.sub". Default = "strat.boot" for length(y) < 200, otherwise "strat.sub"

n.resamples

Integer: Number of training/testing sets required

n.repeats

Integer: Number of times the external resample should be repeated. This allows you to do, for example, 10 times 10-fold crossvalidation. Default = 1. In most cases it makes sense to use 1 repeat of many resamples, e.g. 25 stratified subsamples,

stratify.var

Numeric vector: Used to stratify external sampling (if applicable) Defaults to outcome y

train.p

Float (0, 1): Fraction of cases to assign to traininig set for resampler = "strat.sub"

strat.n.bins

Integer: Number of groups to use for stratification for resampler = "strat.sub" / "strat.boot"

target.length

Integer: Number of cases for training set for resampler = "strat.boot". Default = length(y)

seed

Integer: (Optional) Set seed for random number generator, in order to make output reproducible. See ?base::set.seed

res.index

List where each element is a vector of training set indices. Use this for manual or precalculated train/test splits

res.group

Integer, vector, length = length(y): Integer vector, where numbers define fold membership. e.g. for 10-fold on a dataset with 1000 cases, you could use group = rep(1:10, each = 100)

bag.fn

Function to use to average prediction if bag.fitted = TRUE. Default = median

x.name

String: Name of predictor dataset

y.name

String: Name of outcome

save.mods

Logical: If TRUE, retain trained models in object, otherwise discard (save space if running many resamples). Default = TRUE

save.tune

Logical: If TRUE, save the best.tune data frame for each resample (output of gridSearchLearn)

cex

Float: cex parameter for elevate plot

col

Color for elevate plot

bag.fitted

Logical: If TRUE, use all models to also get a bagged prediction on the full sample. To get a bagged prediction on new data using the same models, use predict.rtModCV

n.cores

Integer: Number of cores to use. Default = 1. You are likely parallelizing either in the inner (tuning) or the learner itself is parallelized. Don't parallelize the parallelization

parallel.type

String: "psock" (Default), "fork"

print.plot

Logical: if TRUE, produce plot using mplot3 Takes precedence over plot.fitted and plot.predicted

plot.fitted

Logical: if TRUE, plot True (y) vs Fitted

plot.predicted

Logical: if TRUE, plot True (y.test) vs Predicted. Requires x.test and y.test

plot.theme

String: "zero", "dark", "box", "darkbox"

print.res.plot

Logical: Print model performance plot for each resample. Defaults to FALSE from all resamples. Defaults to TRUE

question

String: the question you are attempting to answer with this model, in plain language.

verbose

Logical: If TRUE, print summary to screen.

trace

Integer: (Not really used) Print additional information if > 0. Default = 0

res.verbose

Logical: Passed to resLearn, passed to each individual learner's verbose argument

headless

Logical: If TRUE, turn off all plotting.

outdir

String: Path where output should be saved

save.plots

Logical: If TRUE, save plots to outdir

save.rt

Logical: If TRUE and outdir is set, save all models to outdir

save.mod

Logical. If TRUE, save all output as RDS file in outdir save.mod is TRUE by default if an outdir is defined. If set to TRUE, and no outdir is defined, outdir defaults to paste0("./s.", mod.name)

save.res

Logical: If TRUE, save the full output of each model trained on differents resamples under subdirectories of outdir

...

Additional mod.params to be passed to learner (Will be concatenated with mod.params, so that you can use either way to pass learner arguments)

Value

Object of class rtModCV (Regression) or rtModCVclass (Classification)

error.test.repeats

the mean or aggregate error, as appropriate, for each repeat

error.test.repeats.mean

the mean error of all repeats, i.e. the mean of error.test.repeats

error.test.repeats.sd

if n.repeats > 1, the standard deviation of error.test.repeats

error.test.res

the error for each resample, for each repeat

Details

- Note on resampling: You can never use an outer resampling method with replacement if you will also be using an inner resampling (for tuning). The duplicated cases from the outer resampling may appear both in the training and testing sets of the inner resamples, leading to artificially decreased error.

- If there is an error while running either the outer or inner resamples in parallel, the error message returned by R will likely be unhelpful. Repeat the command after setting both inner and outer resample run to use a single core, which should provide an informative message.

Examples

Run this code

# NOT RUN {
# Regression
x <- rnormmat(100, 50)
w <- rnorm(50)
y <- x %*% w + rnorm(50)
mod <- elevate(x, y)
# Classification
data(Sonar, package = "mlbench")
mod <- elevate(Sonar)
# }

Run the code above in your browser using DataLab