Perform k-means clustering on a Spark DataFrame.
ml_kmeans(x, centers, iter.max = 100, features = tbl_vars(x),
compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options(), ...)An object coercable to a Spark DataFrame (typically, a
tbl_spark).
The number of cluster centers to compute.
The maximum number of iterations to use.
The name of features (terms) to use for the model fit.
Whether to compute cost for k-means model using Spark's computeCost.
Param for the convergence tolerance for iterative algorithms.
Optional arguments, used to affect the model generated. See
ml_options for more details.
Optional arguments. The data argument can be used to
specify the data to be used when x is a formula; this allows calls
of the form ml_linear_regression(y ~ x, data = tbl), and is
especially useful in conjunction with do.
ml_model object of class kmeans with overloaded print, fitted and predict functions.
Bahmani et al., Scalable K-Means++, VLDB 2012
For information on how Spark k-means clustering is implemented, please see http://spark.apache.org/docs/latest/mllib-clustering.html#k-means.
Other Spark ML routines: ml_als_factorization,
ml_decision_tree,
ml_generalized_linear_regression,
ml_gradient_boosted_trees,
ml_lda, ml_linear_regression,
ml_logistic_regression,
ml_multilayer_perceptron,
ml_naive_bayes,
ml_one_vs_rest, ml_pca,
ml_random_forest,
ml_survival_regression