Train a Generalized Linear Model for Regression or Classification (i.e. Logistic Regression) using stats::glm
.
If outcome y
has more than two classes, Multinomial Logistic Regression is performed using
nnet::multinom
s.GLM(x, y = NULL, x.test = NULL, y.test = NULL, x.name = NULL,
y.name = NULL, family = NULL, interactions = FALSE,
nway.interactions = 0, covariate = NULL, class.method = NULL,
weights = NULL, ipw = TRUE, ipw.type = 2, upsample = FALSE,
upsample.seed = NULL, intercept = TRUE, polynomial = FALSE,
poly.d = 3, poly.raw = FALSE, print.plot = TRUE,
plot.fitted = NULL, plot.predicted = NULL,
plot.theme = getOption("rt.fit.theme", "lightgrid"),
na.action = na.exclude, removeMissingLevels = TRUE,
question = NULL, rtclass = NULL, verbose = TRUE, trace = 0,
outdir = NULL, save.mod = ifelse(!is.null(outdir), TRUE, FALSE), ...)
Numeric vector or matrix / data frame of features i.e. independent variables
Numeric vector of outcome, i.e. dependent variable
Numeric vector or matrix / data frame of testing set features
Columns must correspond to columns in x
Numeric vector of testing set outcome
Character: Name for feature set
Character: Name for outcome
Error distribution and link function. See stats::family
Logical: If TRUE, include all pairwise interactions. formula = y ~.*.
Integer: Include n-way interactions. This integer defined the n: formula = y ~^n
String: Name of column to be included as interaction term in formula, must be factor
String: Define "logistic" or "multinom" for classification. The only purpose
of this is so you can try nnet::multinom
instead of glm for binary classification
Numeric vector: Weights for cases. For classification, weights
takes precedence
over ipw
, therefore set weights = NULL
if using ipw
.
Note: If weight
are provided, ipw
is not used. Leave NULL if setting ipw = TRUE
. Default = NULL
Logical: If TRUE, apply inverse probability weighting (for Classification only).
Note: If weights
are provided, ipw
is not used. Default = TRUE
Integer 0, 1, 2 1: class.weights as in 0, divided by max(class.weights) 2: class.weights as in 0, divided by min(class.weights) Default = 2
Logical: If TRUE, upsample cases to balance outcome classes (for Classification only) Caution: upsample will randomly sample with replacement if the length of the majority class is more than double the length of the class you are upsampling, thereby introducing randomness
Integer: If provided, will be used to set the seed during upsampling. Default = NULL (random seed)
Logical: If TRUE, fit an intercept term. Default = TRUE
Logical: if TRUE, run lm on poly(x, poly.d)
(creates orthogonal polynomials)
Integer: degree of polynomial. Default = 3
Logical: if TRUE, use raw polynomials. Default, which should not really be changed is FALSE
Logical: if TRUE, produce plot using mplot3
Takes precedence over plot.fitted
and plot.predicted
Logical: if TRUE, plot True (y) vs Fitted
Logical: if TRUE, plot True (y.test) vs Predicted.
Requires x.test
and y.test
String: "zero", "dark", "box", "darkbox"
How to handle missing values. See ?na.fail
Logical: If TRUE, finds factors in x.test
that contain levels
not present in x
and substitutes with NA. This would result in error otherwise and no
predictions would be made, ending s.GLM
prematurely
String: the question you are attempting to answer with this model, in plain language.
String: Class type to use. "S3", "S4", "RC", "R6"
Logical: If TRUE, print summary to screen.
Integer: If higher than 0, will print more information to the console. Default = 0
Path to output directory.
If defined, will save Predicted vs. True plot, if available,
as well as full model output, if save.mod
is TRUE
Logical. If TRUE, save all output as RDS file in outdir
save.mod
is TRUE by default if an outdir
is defined. If set to TRUE, and no outdir
is defined, outdir defaults to paste0("./s.", mod.name)
Additional arguments
A common problem with glm
arises when the testing set containts a predictor with more
levels than those in the same predictor in the training set, resulting in error. This can happen
when training on resamples of a data set, especially after stratifying against a different
outcome, and results in error and no prediction. s.GLM
automatically finds such cases
and substitutes levels present in x.test
and not in x
with NA.
elevate for external cross-validation
Other Supervised Learning: s.ADABOOST
,
s.ADDTREE
, s.BART
,
s.BAYESGLM
, s.BRUTO
,
s.C50
, s.CART
,
s.CTREE
, s.DA
,
s.ET
, s.EVTREE
,
s.GAM.default
, s.GAM.formula
,
s.GAMSEL
, s.GAM
,
s.GBM3
, s.GBM
,
s.GLMNET
, s.GLS
,
s.H2ODL
, s.H2OGBM
,
s.H2ORF
, s.IRF
,
s.KNN
, s.LDA
,
s.LM
, s.MARS
,
s.MLRF
, s.MXN
,
s.NBAYES
, s.NLA
,
s.NLS
, s.NW
,
s.POLYMARS
, s.PPR
,
s.PPTREE
, s.QDA
,
s.QRNN
, s.RANGER
,
s.RFSRC
, s.RF
,
s.SGD
, s.SPLS
,
s.SVM
, s.TFN
,
s.XGBLIN
, s.XGB
Other Interpretable models: s.ADDTREE
,
s.C50
, s.CART
,
s.GLMNET
# NOT RUN {
x <- rnorm(100)
y <- .6 * x + 12 + rnorm(100)/2
mod <- s.GLM(x, y)
# }
Run the code above in your browser using DataLab