h2o.naiveBayes: Compute naive Bayes probabilities on an H2O dataset.

Description

The naive Bayes classifier assumes independence between predictor variables conditional on the response, and a Gaussian distribution of numeric predictors with mean and standard deviation computed from the training dataset. When building a naive Bayes classifier, every row in the training dataset that contains at least one NA will be skipped completely. If the test dataset has missing values, then those predictors are omitted in the probability calculation during prediction.

Usage

h2o.naiveBayes(
  x,
  y,
  training_frame,
  model_id = NULL,
  nfolds = 0,
  seed = -1,
  fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"),
  fold_column = NULL,
  keep_cross_validation_models = TRUE,
  keep_cross_validation_predictions = FALSE,
  keep_cross_validation_fold_assignment = FALSE,
  validation_frame = NULL,
  ignore_const_cols = TRUE,
  score_each_iteration = FALSE,
  balance_classes = FALSE,
  class_sampling_factors = NULL,
  max_after_balance_size = 5,
  laplace = 0,
  threshold = 0.001,
  min_sdev = 0.001,
  eps = 0,
  eps_sdev = 0,
  min_prob = 0.001,
  eps_prob = 0,
  compute_metrics = TRUE,
  max_runtime_secs = 0,
  export_checkpoints_dir = NULL,
  gainslift_bins = -1,
  auc_type = c("AUTO", "NONE", "MACRO_OVR", "WEIGHTED_OVR", "MACRO_OVO", "WEIGHTED_OVO")
)

Value

an object of class H2OBinomialModel if the response has two categorical levels, and H2OMultinomialModel otherwise.

Arguments

x: (Optional) A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used.
y: The name or column index of the response variable in the data. The response must be either a numeric or a categorical/factor variable. If the response is numeric, then a regression model will be trained, otherwise it will train a classification model.
training_frame: Id of the training data frame.
model_id: Destination id for this model; auto-generated if not specified.
nfolds: Number of folds for K-fold cross-validation (0 to disable or >= 2). Defaults to 0.
seed: Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default). Defaults to -1 (time-based random number).
fold_assignment: Cross-validation fold assignment scheme, if fold_column is not specified. The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Must be one of: "AUTO", "Random", "Modulo", "Stratified". Defaults to AUTO.
fold_column: Column with cross-validation fold index assignment per observation.
keep_cross_validation_models: Logical. Whether to keep the cross-validation models. Defaults to TRUE.
keep_cross_validation_predictions: Logical. Whether to keep the predictions of the cross-validation models. Defaults to FALSE.
keep_cross_validation_fold_assignment: Logical. Whether to keep the cross-validation fold assignment. Defaults to FALSE.
validation_frame: Id of the validation data frame.
ignore_const_cols: Logical. Ignore constant columns. Defaults to TRUE.
score_each_iteration: Logical. Whether to score during each iteration of model training. Defaults to FALSE.
balance_classes: Logical. Balance training data class counts via over/under-sampling (for imbalanced data). Defaults to FALSE.
class_sampling_factors: Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
max_after_balance_size: Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. Defaults to 5.0.
laplace: Laplace smoothing parameter Defaults to 0.
threshold: This argument is deprecated, use `min_sdev` instead. The minimum standard deviation to use for observations without enough data. Must be at least 1e-10.
min_sdev: The minimum standard deviation to use for observations without enough data. Must be at least 1e-10.
eps: This argument is deprecated, use `eps_sdev` instead. A threshold cutoff to deal with numeric instability, must be positive.
eps_sdev: A threshold cutoff to deal with numeric instability, must be positive.
min_prob: Min. probability to use for observations with not enough data.
eps_prob: Cutoff below which probability is replaced with min_prob.
compute_metrics: Logical. Compute metrics on training data Defaults to TRUE.
max_runtime_secs: Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0.
export_checkpoints_dir: Automatically export generated models to this directory.
gainslift_bins: Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning. Defaults to -1.
auc_type: Set default multinomial AUC type. Must be one of: "AUTO", "NONE", "MACRO_OVR", "WEIGHTED_OVR", "MACRO_OVO", "WEIGHTED_OVO". Defaults to AUTO.

Examples

Run this code

if (FALSE) {
h2o.init()
votes_path <- system.file("extdata", "housevotes.csv", package = "h2o")
votes <- h2o.uploadFile(path = votes_path, header = TRUE)
h2o.naiveBayes(x = 2:17, y = 1, training_frame = votes, laplace = 3)
}

Run the code above in your browser using DataLab