h2o.prcomp: Principal Components Analysis

Description

Principal components analysis of an H2O data frame using the power method to calculate the singular value decomposition of the Gram matrix.

Usage

h2o.prcomp(training_frame, x, k, model_id, ignore_const_cols = TRUE, max_iterations = 1000, transform = c("NONE", "DEMEAN", "DESCALE", "STANDARDIZE"), pca_method = c("GramSVD", "Power", "Randomized", "GLRM"), use_all_factor_levels = FALSE, compute_metrics = TRUE, impute_missing = FALSE, seed, max_runtime_secs = 0)

Arguments

training_frame

An H2OFrame object containing the variables in the model.

(Optional) A vector containing the data columns on which SVD operates.

The number of principal components to be computed. This must be between 1 and min(ncol(training_frame), nrow(training_frame)) inclusive.

model_id

(Optional) The unique hex key assigned to the resulting model. Automatically generated if none is provided.

ignore_const_cols

A logical value indicating whether or not to ignore all the constant columns in the training frame.

max_iterations

The maximum number of iterations to run each power iteration loop. Must be between 1 and 1e6 inclusive.

transform

A character string that indicates how the training data should be transformed before running PCA. Possible values are "NONE": for no transformation, "DEMEAN": for subtracting the mean of each column, "DESCALE": for dividing by the standard deviation of each column, "STANDARDIZE": for demeaning and descaling, and "NORMALIZE": for demeaning and dividing each column by its range (max - min).

pca_method

A character string that indicates how PCA should be calculated. Possible values are "GramSVD": distributed computation of the Gram matrix followed by a local SVD using the JAMA package, "Power": computation of the SVD using the power iteration method, "Randomized": approximate SVD by projecting onto a random subspace (see references), "GLRM": fit a generalized low rank model with an l2 loss function (no regularization) and solve for the SVD using local matrix algebra.

use_all_factor_levels

(Optional) A logical value indicating whether all factor levels should be included in each categorical column expansion. If FALSE, the indicator column corresponding to the first factor level of every categorical variable will be dropped. Defaults to FALSE.

compute_metrics

(Optional) A logical value indicating whether to compute metrics on the training data, which requires additional calculation time. Only used if pca_method = "GLRM". Defaults to TRUE.

impute_missing

(Optional) A logical value indicating whether missing values should be imputed with the mean of the corresponding column. This is necessary if too many entries are NA when using methods like GramSVD. Defaults to FALSE.

seed

(Optional) Random seed used to initialize the right singular vectors at the beginning of each power method iteration.

max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Value

Returns an object of class H2ODimReductionModel.

References

N. Halko, P.G. Martinsson, J.A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions[http://arxiv.org/abs/0909.4061]. SIAM Rev., Survey and Review section, Vol. 53, num. 2, pp. 217-288, June 2011.

Examples

Run this code


library(h2o)
h2o.init()
ausPath <- system.file("extdata", "australia.csv", package="h2o")
australia.hex <- h2o.uploadFile(path = ausPath)
h2o.prcomp(training_frame = australia.hex, k = 8, transform = "STANDARDIZE")