s.MLRF: Spark MLlib Random Forest [C, R]

Description

Train an MLlib Random Forest model on Spark

Usage

s.MLRF(x, y = NULL, x.test = NULL, y.test = NULL, upsample = FALSE,
  upsample.seed = NULL, n.trees = 500, max.depth = 30L,
  subsampling.rate = 1, min.instances.per.node = 1,
  feature.subset.strategy = "auto", max.bins = 32L, x.name = NULL,
  y.name = NULL, spark.master = "local", print.plot = TRUE,
  plot.fitted = NULL, plot.predicted = NULL,
  plot.theme = getOption("rt.fit.theme", "lightgrid"), question = NULL,
  verbose = TRUE, trace = 0, outdir = NULL,
  save.mod = ifelse(!is.null(outdir), TRUE, FALSE), ...)

Arguments

vector, matrix or dataframe of training set features

vector of outcomes

x.test

vector, matrix or dataframe of testing set features

y.test

vector of testing set outcomes

upsample

Logical: If TRUE, upsample cases to balance outcome classes (for Classification only) Caution: upsample will randomly sample with replacement if the length of the majority class is more than double the length of the class you are upsampling, thereby introducing randomness

upsample.seed

Integer: If provided, will be used to set the seed during upsampling. Default = NULL (random seed)

n.trees

Integer. Number of trees to train

max.depth

Integer. Max depth of each tree

max.bins

Integer. Max N of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.

x.name

Character: Name for feature set

y.name

Character: Name for outcome

spark.master

Spark cluster URL or "local"

print.plot

Logical: if TRUE, produce plot using mplot3 Takes precedence over plot.fitted and plot.predicted

plot.fitted

Logical: if TRUE, plot True (y) vs Fitted

plot.predicted

Logical: if TRUE, plot True (y.test) vs Predicted. Requires x.test and y.test

plot.theme

String: "zero", "dark", "box", "darkbox"

question

String: the question you are attempting to answer with this model, in plain language.

verbose

Logical: If TRUE, print summary to screen.

trace

Integer: If higher than 0, will print more information to the console. Default = 0

outdir

Path to output directory. If defined, will save Predicted vs. True plot, if available, as well as full model output, if save.mod is TRUE

save.mod

Logical. If TRUE, save all output as RDS file in outdir save.mod is TRUE by default if an outdir is defined. If set to TRUE, and no outdir is defined, outdir defaults to paste0("./s.", mod.name)

...

Additional arguments

type

"regression" for continuous outcome; "classification" for categorical outcome. "auto" will result in regression for numeric y and classification otherwise

Value

rtMod object

Details

The overhead incurred by Spark means this should be used only for really large datasets on a Spark cluster, not on a regular local machine.

Other Supervised Learning: s.ADABOOST, s.ADDTREE, s.BART, s.BAYESGLM, s.BRUTO, s.C50, s.CART, s.CTREE, s.DA, s.ET, s.EVTREE, s.GAM.default, s.GAM.formula, s.GAMSEL, s.GAM, s.GBM3, s.GBM, s.GLMNET, s.GLM, s.GLS, s.H2ODL, s.H2OGBM, s.H2ORF, s.IRF, s.KNN, s.LDA, s.LM, s.MARS, s.MXN, s.NBAYES, s.NLA, s.NLS, s.NW, s.POLYMARS, s.PPR, s.PPTREE, s.QDA, s.QRNN, s.RANGER, s.RFSRC, s.RF, s.SGD, s.SPLS, s.SVM, s.TFN, s.XGBLIN, s.XGB

Other Tree-based methods: s.ADABOOST, s.ADDTREE, s.BART, s.C50, s.CART, s.CTREE, s.ET, s.EVTREE, s.GBM3, s.GBM, s.H2OGBM, s.H2ORF, s.IRF, s.PPTREE, s.RANGER, s.RFSRC, s.RF, s.XGB

Description

Usage

Arguments

Value

Details

See Also