h2o.glm: H2O: Generalized Linear Models

Description

Fit a generalized linear model, specified by a response variable, a set of predictors, and a description of the error distribution.

Usage

## Default method:
h2o.glm(x, y, data, family, nfolds = 10, alpha = 0.5, lambda = 1e-5, epsilon = 1e-4, 
  standardize = TRUE, prior, tweedie.p = ifelse(family == 'tweedie', 1.5, 
  as.numeric(NA)), thresholds, iter.max, higher_accuracy, lambda_search, version = 2)

## Import to a ValueArray object:
h2o.glm.VA(x, y, data, family, nfolds = 10, alpha = 0.5, lambda = 1e-5, epsilon = 1e-4, 
  standardize = TRUE, prior, tweedie.p = ifelse(family == 'tweedie', 1.5, 
  as.numeric(NA)), thresholds = ifelse(family == 'binomial', seq(0, 1, 0.01), 
  as.numeric(NA)))

## Import to a FluidVecs object:
h2o.glm.FV(x, y, data, family, nfolds = 10, alpha = 0.5, lambda = 1e-5, epsilon = 1e-4, 
  standardize = TRUE, prior, tweedie.p = ifelse(family == 'tweedie', 1.5, 
  as.numeric(NA)), iter.max = 100, higher_accuracy = FALSE, lambda_search = FALSE)

Arguments

A vector containing the names of the predictors in the model.

The name of the response variable in the model.

data

An H2OParsedDataVA (version = 1) or H2OParsedData (version = 2) object containing the variables in the model.

family

A description of the error distribution and corresponding link function to be used in the model. Currently, Gaussian, binomial, Poisson, gamma, and Tweedie are supported. When a model is specified as Tweedie, users must also specify the appropriate Tweedi

nfolds

(Optional) Number of folds for cross-validation. The default is 10.

alpha

(Optional) The elastic-net mixing parameter, which must be in [0,1]. The penalty is defined to be $$P(\alpha,\beta) = (1-\alpha)/2||\beta||_2^2 + \alpha||\beta||_1 = \sum_j [(1-\alpha)/2 \beta_j^2 + \alpha|\beta_j|]$$ so alpha=1 is the lasso

lambda

The shrinkage parameter, which multiples $P(\alpha,\beta)$ in the objective. The larger lambda is, the more the coefficients are shrunk toward zero (and each other).

epsilon

(Optional) Number indicating the cutoff for determining if a coefficient is zero.

standardize

(Optional) Logical value indicating whether the data should be standardized (set to mean = 0, variance = 1) before running GLM.

prior

(Optional) Prior probability of class 1. Only used if family = "binomial". When omitted, prior will default to the frequency of class 1 in the response column.

tweedie.p

(Optional) The index of the power variance function for the tweedie distribution. Only used if family = "tweedie".

thresholds

(Optional) Degree to which to weight the sensitivity (the proportion of correctly classified 1's) and specificity (the proportion of correctly classified 0s). The default option is joint optimization for the overall classification rate. Changing this will

iter.max

(Optional) Maximum number of iterations allowed.

higher_accuracy

(Optional) A logical value indicating whether to use line search. This will cause the algorithm to run slower, so generally, it should only be set to TRUE if GLM does not converge otherwise.

lambda_search

(Optional) A logical value indicating whether to onduct a search over the space of lambda values, starting from lambda_max. When this is set to TRUE, lambda will be interpreted as lambda_min.

version

(Optional) The version of GLM to run. If version = 1, this will run the more stable ValueArray implementation, while version = 2 runs the faster, but still beta stage FluidVecs implementation.

Value

An object of class H2OGLMModelVA (version = 1) or H2OGLMModel (version = 2) with slots key, data, model and xval. The slot model is a list of the following components:
coefficientsA named vector of the coefficients estimated in the model.
rankThe numeric rank of the fitted linear model.
familyThe family of the error distribution.
devianceThe deviance of the fitted model.
aicAkaike's Information Criterion for the final computed model.
null.devianceThe deviance for the null model.
iterNumber of algorithm iterations to compute the model.
df.residualThe residual degrees of freedom.
df.nullThe residual degrees of freedom for the null model.
yThe response variable in the model.
xA vector of the predictor variable(s) in the model.
aucArea under the curve.
training.errAverage training error.
thresholdBest threshold.
confusionConfusion matrix.
The slot xval is a list of H2OGLMModel or H2OGLMModelVA objects representing the cross-validation models. (Each of these objects themselves has xval equal to an empty list).

Details

IMPORTANT: Currently, to run GLM with version = 1, you must import data to a ValueArray object using h2o.importFile.VA, h2o.importFolder.VA or one of its variants. To run with version = 2, you must import data to a FluidVecs object using h2o.importFile.FV, h2o.importFolder.FV or one of its variants.

Examples

Run this code

library(h2o)
localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)

# Run GLM of CAPSULE ~ AGE + RACE + PSA + DCAPS
prostate.hex = h2o.importURL(localH2O, path = paste("https://raw.github.com", 
  "0xdata/h2o/master/smalldata/logreg/prostate.csv", sep = "/"), key = "prostate.hex")
h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), data = prostate.hex, family = "binomial", 
  nfolds = 10, alpha = 0.5)
# Run GLM of VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON
myX = setdiff(colnames(prostate.hex), c("ID", "DPROS", "DCAPS", "VOL"))
h2o.glm(y = "VOL", x = myX, data = prostate.hex, family = "gaussian", nfolds = 5, alpha = 0.1)

Run the code above in your browser using DataLab