s.GLM: Generalized Linear Model [C, R]

Description

Train a Generalized Linear Model for Regression or Classification (i.e. Logistic Regression) using stats::glm. If outcome y has more than two classes, Multinomial Logistic Regression is performed using nnet::multinom

Usage

s.GLM(x, y = NULL, x.test = NULL, y.test = NULL, x.name = NULL,
  y.name = NULL, family = NULL, interactions = FALSE,
  nway.interactions = 0, covariate = NULL, class.method = NULL,
  weights = NULL, ipw = TRUE, ipw.type = 2, upsample = FALSE,
  upsample.seed = NULL, intercept = TRUE, polynomial = FALSE,
  poly.d = 3, poly.raw = FALSE, print.plot = TRUE,
  plot.fitted = NULL, plot.predicted = NULL,
  plot.theme = getOption("rt.fit.theme", "lightgrid"),
  na.action = na.exclude, removeMissingLevels = TRUE,
  question = NULL, rtclass = NULL, verbose = TRUE, trace = 0,
  outdir = NULL, save.mod = ifelse(!is.null(outdir), TRUE, FALSE), ...)

Arguments

Numeric vector or matrix / data frame of features i.e. independent variables

Numeric vector of outcome, i.e. dependent variable

x.test

Numeric vector or matrix / data frame of testing set features Columns must correspond to columns in x

y.test

Numeric vector of testing set outcome

x.name

Character: Name for feature set

y.name

Character: Name for outcome

family

Error distribution and link function. See stats::family

interactions

Logical: If TRUE, include all pairwise interactions. formula = y ~.*.

nway.interactions

Integer: Include n-way interactions. This integer defined the n: formula = y ~^n

covariate

String: Name of column to be included as interaction term in formula, must be factor

class.method

String: Define "logistic" or "multinom" for classification. The only purpose of this is so you can try nnet::multinom instead of glm for binary classification

weights

Numeric vector: Weights for cases. For classification, weights takes precedence over ipw, therefore set weights = NULL if using ipw. Note: If weight are provided, ipw is not used. Leave NULL if setting ipw = TRUE. Default = NULL

ipw

Logical: If TRUE, apply inverse probability weighting (for Classification only). Note: If weights are provided, ipw is not used. Default = TRUE

ipw.type

Integer 0, 1, 2 1: class.weights as in 0, divided by max(class.weights) 2: class.weights as in 0, divided by min(class.weights) Default = 2

upsample

Logical: If TRUE, upsample cases to balance outcome classes (for Classification only) Caution: upsample will randomly sample with replacement if the length of the majority class is more than double the length of the class you are upsampling, thereby introducing randomness

upsample.seed

Integer: If provided, will be used to set the seed during upsampling. Default = NULL (random seed)

intercept

Logical: If TRUE, fit an intercept term. Default = TRUE

polynomial

Logical: if TRUE, run lm on poly(x, poly.d) (creates orthogonal polynomials)

poly.d

Integer: degree of polynomial. Default = 3

poly.raw

Logical: if TRUE, use raw polynomials. Default, which should not really be changed is FALSE

print.plot

Logical: if TRUE, produce plot using mplot3 Takes precedence over plot.fitted and plot.predicted

plot.fitted

Logical: if TRUE, plot True (y) vs Fitted

plot.predicted

Logical: if TRUE, plot True (y.test) vs Predicted. Requires x.test and y.test

plot.theme

String: "zero", "dark", "box", "darkbox"

na.action

How to handle missing values. See ?na.fail

removeMissingLevels

Logical: If TRUE, finds factors in x.test that contain levels not present in x and substitutes with NA. This would result in error otherwise and no predictions would be made, ending s.GLM prematurely

question

String: the question you are attempting to answer with this model, in plain language.

rtclass

String: Class type to use. "S3", "S4", "RC", "R6"

verbose

Logical: If TRUE, print summary to screen.

trace

Integer: If higher than 0, will print more information to the console. Default = 0

outdir

Path to output directory. If defined, will save Predicted vs. True plot, if available, as well as full model output, if save.mod is TRUE

save.mod

Logical. If TRUE, save all output as RDS file in outdir save.mod is TRUE by default if an outdir is defined. If set to TRUE, and no outdir is defined, outdir defaults to paste0("./s.", mod.name)

...

Additional arguments

Value

rtMod

Details

A common problem with glm arises when the testing set containts a predictor with more levels than those in the same predictor in the training set, resulting in error. This can happen when training on resamples of a data set, especially after stratifying against a different outcome, and results in error and no prediction. s.GLM automatically finds such cases and substitutes levels present in x.test and not in x with NA.

Examples

Run this code

# NOT RUN {
x <- rnorm(100)
y <- .6 * x + 12 + rnorm(100)/2
mod <- s.GLM(x, y)
# }

Run the code above in your browser using DataLab