h2o.naiveBayes: Naive Bayes Model in H2O

Description

Compute naive Bayes probabilities on an H2O dataset.

Usage

h2o.naiveBayes(x, y, training_frame, validation_frame = NULL, model_id, ignore_const_cols = TRUE, laplace = 0, threshold = 0.001, eps = 0, nfolds = 0, fold_column = NULL, fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"), seed, keep_cross_validation_predictions = FALSE, keep_cross_validation_fold_assignment = FALSE, compute_metrics = TRUE, max_runtime_secs = 0)

Arguments

A vector containing the names or indices of the predictor variables to use in building the model.

The name or index of the response variable. If the data does not contain a header, this is the column index number starting at 0, and increasing from left to right. The response must be a categorical variable with at least two levels.

training_frame

An H2OFrame object containing the variables in the model.

validation_frame

An H2OFrame object containing the variables in the model. Defaults to NULL.

model_id

(Optional) The unique id assigned to the resulting model. If none is given, an id will automatically be generated.

ignore_const_cols

A logical value indicating whether or not to ignore all the constant columns in the training frame.

laplace

A positive number controlling Laplace smoothing. The default zero disables smoothing.

threshold

The minimum standard deviation to use for observations without enough data. Must be at least 1e-10.

eps

A threshold cutoff to deal with numeric instability, must be positive.

nfolds

(Optional) Number of folds for cross-validation.

fold_column

(Optional) Column with cross-validation fold index assignment per observation

fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified, must be "AUTO", "Random", "Modulo", or "Stratified". The Stratified option will stratify the folds based on the response variable, for classification problems.

seed

Seed for random numbers (affects sampling).

keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation models

keep_cross_validation_fold_assignment

Whether to keep the cross-validation fold assignment.

compute_metrics

A logical value indicating whether model metrics should be computed. Set to FALSE to reduce the runtime of the algorithm.

max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Value

Returns an object of class H2OBinomialModel if the response has two categorical levels, and H2OMultinomialModel otherwise.

Details

The naive Bayes classifier assumes independence between predictor variables conditional on the response, and a Gaussian distribution of numeric predictors with mean and standard deviation computed from the training dataset. When building a naive Bayes classifier, every row in the training dataset that contains at least one NA will be skipped completely. If the test dataset has missing values, then those predictors are omitted in the probability calculation during prediction.

Examples

Run this code


 h2o.init()
 votesPath <- system.file("extdata", "housevotes.csv", package="h2o")
 votes.hex <- h2o.uploadFile(path = votesPath, header = TRUE)
 h2o.naiveBayes(x = 2:17, y = 1, training_frame = votes.hex, laplace = 3)

Run the code above in your browser using DataLab