h2o.glm: H2O: Generalized Linear Models

Description

Fit a generalized linear model, specified by a response variable, a set of predictors, and a description of the error distribution.

Usage

h2o.glm(x, y, data, key = "", family, link, nfolds = 0, alpha = 0.5, nlambda = -1, 
  lambda.min.ratio = -1, lambda = 1e-5, epsilon = 1e-4, standardize = TRUE, 
  prior, variable_importances = FALSE, use_all_factor_levels = FALSE, tweedie.p =
  ifelse(family == 'tweedie', 1.5, as.numeric(NA)), iter.max = 100, 
  higher_accuracy = FALSE, lambda_search = FALSE, return_all_lambda = FALSE, 
  max_predictors = -1)

Arguments

A vector containing the names of the predictors in the model.

The name of the response variable in the model.

data

An H2OParsedData object containing the variables in the model.

key

(Optional) The unique hex key assigned to the resulting model. If none is given, a key will automatically be generated.

family

A description of the error distribution and corresponding link function to be used in the model. Currently, Gaussian, binomial, Poisson, gamma, and Tweedie are supported. When a model is specified as Tweedie, users must also specify the appropriate Tweedi

link

(Optional) The link function relates the linear predictor to the distribution function. Default is the canonical link for the specified family. The full list of supported links: gaussian: identity, log, inverse binomial: logit, log poisson: log, ident

nfolds

(Optional) Number of folds for cross-validation.

alpha

(Optional) The elastic-net mixing parameter, which must be in [0,1]. The penalty is defined to be $$P(\alpha,\beta) = (1-\alpha)/2||\beta||_2^2 + \alpha||\beta||_1 = \sum_j [(1-\alpha)/2 \beta_j^2 + \alpha|\beta_j|]$$ so alpha=1 is the lasso

nlambda

The number of lambda values when performing a search.

lambda.min.ratio

Smallest value for lambda as a fraction of lambda.max, the entry value, which is the smallest value for which all coefficients in the model are zero.

lambda

The shrinkage parameter, which multiplies $P(\alpha,\beta)$ in the objective. The larger lambda is, the more the coefficients are shrunk toward zero (and each other).

epsilon

(Optional) Number indicating the cutoff for determining if a coefficient is zero.

standardize

(Optional) Logical value indicating whether the data should be standardized (set to mean = 0, variance = 1) before running GLM.

prior

(Optional) Prior probability of class 1. Only used if family = "binomial". When omitted, prior will default to the frequency of class 1 in the response column.

variable_importances

(Optional) A logical value either TRUE or FALSE to indicate whether the variable importances should be computed. Compute variable importances for input features. NOTE: If use_all_factor_levels is off the importance of the base level will NOT be shown.

use_all_factor_levels

(Optional) A logical value either TRUE or FALSE to indicate whether all factor levels should be used. By default, first factor level is skipped from the possible set of predictors. Set this flag if you want use all of the levels. Needs sufficient regulari

tweedie.p

(Optional) The index of the power variance function for the tweedie distribution. Only used if family = "tweedie".

iter.max

(Optional) Maximum number of iterations allowed.

higher_accuracy

(Optional) A logical value indicating whether to use line search. This will cause the algorithm to run slower, so generally, it should only be set to TRUE if GLM does not converge otherwise.

lambda_search

(Optional) A logical value indicating whether to conduct a search over the space of lambda values, starting from lambda_max. When this is set to TRUE, lambda will be interpreted as lambda_min.

return_all_lambda

(Optional) A logical value indicating whether to return every model built during the lambda search. Only used if lambda_search = TRUE. If return_all_lambda = FALSE, then only the model corresponding to the optimal lambda will be

max_predictors

(Optional) When lambda_search = TRUE, the algorithm will stop training if the number of predictors exceeds this value. Ignored when lambda_search = FALSE or max_predictors = -1.

Value

An object of class H2OGLMModel with slots key, data, model and xval. The slot model is a list of the following components:
coefficientsA named vector of the coefficients estimated in the model.
rankThe numeric rank of the fitted linear model.
familyThe family of the error distribution.
devianceThe deviance of the fitted model.
aicAkaike's Information Criterion for the final computed model.
null.devianceThe deviance for the null model.
iterNumber of algorithm iterations to compute the model.
df.residualThe residual degrees of freedom.
df.nullThe residual degrees of freedom for the null model.
yThe response variable in the model.
xA vector of the predictor variable(s) in the model.
aucArea under the curve.
training.errAverage training error.
thresholdBest threshold.
confusionConfusion matrix.
The slot xval is a list of H2OGLMModel objects representing the cross-validation models. (Each of these objects themselves has xval equal to an empty list).

Examples

Run this code

# -- CRAN examples begin --
library(h2o)
localH2O = h2o.init()

# Run GLM of CAPSULE ~ AGE + RACE + PSA + DCAPS
prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = h2o.importFile(localH2O, path = prostatePath, key = "prostate.hex")
h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), data = prostate.hex, 
        family = "binomial", nfolds = 0, alpha = 0.5, lambda_search = FALSE, 
        use_all_factor_levels = FALSE, variable_importances = FALSE, higher_accuracy = FALSE)

# Run GLM of VOL ~ CAPSULE + AGE + RACE + PSA + GLEASON
myX = setdiff(colnames(prostate.hex), c("ID", "DPROS", "DCAPS", "VOL"))
h2o.glm(y = "VOL", x = myX, data = prostate.hex, family = "gaussian", nfolds = 0, alpha = 0.1,
        lambda_search = FALSE, use_all_factor_levels = FALSE, variable_importances = FALSE, 
        higher_accuracy = FALSE)
# -- CRAN examples end --

# GLM variable importance
# Also see:
#   https://github.com/0xdata/h2o/blob/master/R/tests/testdir_demos/runit_demo_VI_all_algos.R
data.hex = h2o.importFile(
  localH2O,
  path = "https://raw.github.com/0xdata/h2o/master/smalldata/bank-additional-full.csv",
  key = "data.hex")
myX = 1:20
myY="y"
my.glm = h2o.glm(x=myX, y=myY, data=data.hex, family="binomial",standardize=T,
                 use_all_factor_levels=T,higher_accuracy=T,lambda_search=T,
                 return_all_lambda=T,variable_importances=T)
best_model = my.glm@best_model
n_coeff = abs(my.glm@models[[best_model]]@model$normalized_coefficients)
VI = abs(n_coeff[-length(n_coeff)])
glm.VI = VI[order(VI,decreasing=T)]
print(glm.VI)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples