Train an MLlib Random Forest model on Spark
s.MLRF(x, y = NULL, x.test = NULL, y.test = NULL, upsample = FALSE,
upsample.seed = NULL, n.trees = 500, max.depth = 30L,
subsampling.rate = 1, min.instances.per.node = 1,
feature.subset.strategy = "auto", max.bins = 32L, x.name = NULL,
y.name = NULL, spark.master = "local", print.plot = TRUE,
plot.fitted = NULL, plot.predicted = NULL,
plot.theme = getOption("rt.fit.theme", "lightgrid"), question = NULL,
verbose = TRUE, trace = 0, outdir = NULL,
save.mod = ifelse(!is.null(outdir), TRUE, FALSE), ...)
vector, matrix or dataframe of training set features
vector of outcomes
vector, matrix or dataframe of testing set features
vector of testing set outcomes
Logical: If TRUE, upsample cases to balance outcome classes (for Classification only) Caution: upsample will randomly sample with replacement if the length of the majority class is more than double the length of the class you are upsampling, thereby introducing randomness
Integer: If provided, will be used to set the seed during upsampling. Default = NULL (random seed)
Integer. Number of trees to train
Integer. Max depth of each tree
Integer. Max N of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.
Character: Name for feature set
Character: Name for outcome
Spark cluster URL or "local"
Logical: if TRUE, produce plot using mplot3
Takes precedence over plot.fitted
and plot.predicted
Logical: if TRUE, plot True (y) vs Fitted
Logical: if TRUE, plot True (y.test) vs Predicted.
Requires x.test
and y.test
String: "zero", "dark", "box", "darkbox"
String: the question you are attempting to answer with this model, in plain language.
Logical: If TRUE, print summary to screen.
Integer: If higher than 0, will print more information to the console. Default = 0
Path to output directory.
If defined, will save Predicted vs. True plot, if available,
as well as full model output, if save.mod
is TRUE
Logical. If TRUE, save all output as RDS file in outdir
save.mod
is TRUE by default if an outdir
is defined. If set to TRUE, and no outdir
is defined, outdir defaults to paste0("./s.", mod.name)
Additional arguments
"regression" for continuous outcome; "classification" for categorical outcome.
"auto" will result in regression for numeric y
and classification otherwise
rtMod object
The overhead incurred by Spark means this should be used only for really large datasets on a Spark cluster, not on a regular local machine.
elevate for external cross-validation
Other Supervised Learning: s.ADABOOST
,
s.ADDTREE
, s.BART
,
s.BAYESGLM
, s.BRUTO
,
s.C50
, s.CART
,
s.CTREE
, s.DA
,
s.ET
, s.EVTREE
,
s.GAM.default
, s.GAM.formula
,
s.GAMSEL
, s.GAM
,
s.GBM3
, s.GBM
,
s.GLMNET
, s.GLM
,
s.GLS
, s.H2ODL
,
s.H2OGBM
, s.H2ORF
,
s.IRF
, s.KNN
,
s.LDA
, s.LM
,
s.MARS
, s.MXN
,
s.NBAYES
, s.NLA
,
s.NLS
, s.NW
,
s.POLYMARS
, s.PPR
,
s.PPTREE
, s.QDA
,
s.QRNN
, s.RANGER
,
s.RFSRC
, s.RF
,
s.SGD
, s.SPLS
,
s.SVM
, s.TFN
,
s.XGBLIN
, s.XGB
Other Tree-based methods: s.ADABOOST
,
s.ADDTREE
, s.BART
,
s.C50
, s.CART
,
s.CTREE
, s.ET
,
s.EVTREE
, s.GBM3
,
s.GBM
, s.H2OGBM
,
s.H2ORF
, s.IRF
,
s.PPTREE
, s.RANGER
,
s.RFSRC
, s.RF
,
s.XGB