Usage
h2o.glm(x, y, training_frame, model_id, validation_frame = NULL,
ignore_const_cols = TRUE, max_iterations = 50, beta_epsilon = 0,
solver = c("IRLSM", "L_BFGS"), standardize = TRUE,
family = c("gaussian", "binomial", "poisson", "gamma", "tweedie",
"multinomial"), link = c("family_default", "identity", "logit", "log",
"inverse", "tweedie"), tweedie_variance_power = NaN,
tweedie_link_power = NaN, alpha = 0.5, prior = NULL, lambda = 1e-05,
lambda_search = FALSE, nlambdas = -1, lambda_min_ratio = -1,
nfolds = 0, fold_column = NULL, fold_assignment = c("AUTO", "Random",
"Modulo"), keep_cross_validation_predictions = FALSE,
beta_constraints = NULL, offset_column = NULL, weights_column = NULL,
intercept = TRUE, max_active_predictors = -1, objective_epsilon = -1,
gradient_epsilon = -1, non_negative = FALSE, compute_p_values = FALSE,
remove_collinear_columns = FALSE, max_runtime_secs = 0,
missing_values_handling = c("MeanImputation", "Skip"))
Arguments
x
A vector containing the names or indices of the predictor variables to use in building the GLM model.
y
A character string or index that represent the response variable in the model.
training_frame
An H2OFrame object containing the variables in the model.
model_id
(Optional) The unique id assigned to the resulting model. If none is given, an id will automatically be generated.
validation_frame
An H2OFrame object containing the variables in the model. Defaults to NULL.
ignore_const_cols
A logical value indicating whether or not to ignore all the constant columns in the training frame.
max_iterations
A non-negative integer specifying the maximum number of iterations.
beta_epsilon
A non-negative number specifying the magnitude of the maximum difference between the coefficient estimates from successive iterations.
Defines the convergence criterion for h2o.glm
.
solver
A character string specifying the solver used: IRLSM (supports more features), L_BFGS (scales better for datasets with many columns)
standardize
A logical value indicating whether the numeric predictors should be standardized to have a mean of 0 and a variance of 1 prior to
training the models.
family
A character string specifying the distribution of the model: gaussian, binomial, poisson, gamma, tweedie.
link
A character string specifying the link function. The default is the canonical link for the family
. The supported links for each of
the family
specifications are:
"gaussian"
: "identity"
, "log"
tweedie_variance_power
A numeric specifying the power for the variance function when family = "tweedie"
.
tweedie_link_power
A numeric specifying the power for the link function when family = "tweedie"
.
alpha
A numeric in [0, 1] specifying the elastic-net mixing parameter.
The elastic-net penalty is defined to be:
$$P(\alpha,\beta) = (1-\alpha)/2||\beta||_2^2 + \alpha||\beta||_1 = \sum_j [(1-\alpha)/2 \beta_j^2 + \alpha|\beta_j|]$$
making alpha = 1
prior
(Optional) A numeric specifying the prior probability of class 1 in the response when family = "binomial"
.
The default prior is the observational frequency of class 1. Must be from (0,1) exclusive range or NULL (no prior).
lambda
A non-negative shrinkage parameter for the elastic-net, which multiplies $P(\alpha,\beta)$ in the objective function.
When lambda = 0
, no elastic-net penalty is applied and ordinary generalized linear models are fit.
lambda_search
A logical value indicating whether to conduct a search over the space of lambda values starting from the lambda max, given
lambda
is interpreted as lambda min.
nlambdas
The number of lambda values to use when lambda_search = TRUE
.
lambda_min_ratio
Smallest value for lambda as a fraction of lambda.max. By default if the number of observations is greater than the
the number of variables then lambda_min_ratio
= 0.0001; if the number of observations is less than the number
of variables the
nfolds
(Optional) Number of folds for cross-validation. If nfolds >= 2
, then validation
must remain empty.
fold_column
(Optional) Column with cross-validation fold index assignment per observation.
fold_assignment
Cross-validation fold assignment scheme, if fold_column is not specified
Must be "AUTO", "Random" or "Modulo".
keep_cross_validation_predictions
Whether to keep the predictions of the cross-validation models.
beta_constraints
A data.frame or H2OParsedData object with the columns ["names",
"lower_bounds", "upper_bounds", "beta_given", "rho"], where each row corresponds to a predictor
in the GLM. "names" contains the predictor names, "lower_bounds" and "upper_bounds" are the low
offset_column
Specify the offset column.
weights_column
Specify the weights column.
intercept
Logical, include constant term (intercept) in the model.
max_active_predictors
(Optional) Convergence criteria for number of predictors when using L1 penalty.
objective_epsilon
Convergence criteria. Converge if relative change in objective function is below this threshold.
gradient_epsilon
Convergence criteria. Converge if gradient l-infinity norm is below this threshold.
non_negative
Logical, allow only positive coefficients.
compute_p_values
(Optional) Logical, compute p-values, only allowed with IRLSM solver and no regularization. May fail if there are collinear predictors.
remove_collinear_columns
(Optional) Logical, valid only with no regularization. If set, co-linear columns will be automatically ignored (coefficient will be 0).
max_runtime_secs
Maximum allowed runtime in seconds for model training. Use 0 to disable.
missing_values_handling
(Optional) Controls handling of missing values. Can be either "MeanImputation" or "Skip". MeanImputation replaces missing values with mean for numeric and most frequent level for categorical, Skip ignores observations with any missing value. Applied both
...
(Currently Unimplemented)
coefficients.