Perform k-means clustering on a Spark DataFrame.
ml_kmeans(x, centers, iter.max = 100, features = tbl_vars(x),
compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options(), ...)
An object coercable to a Spark DataFrame (typically, a
tbl_spark
).
The number of cluster centers to compute.
The maximum number of iterations to use.
The name of features (terms) to use for the model fit.
Whether to compute cost for k-means
model using Spark's computeCost.
Param for the convergence tolerance for iterative algorithms.
Optional arguments, used to affect the model generated. See
ml_options
for more details.
Optional arguments. The data
argument can be used to
specify the data to be used when x
is a formula; this allows calls
of the form ml_linear_regression(y ~ x, data = tbl)
, and is
especially useful in conjunction with do
.
ml_model object of class kmeans
with overloaded print
, fitted
and predict
functions.
Bahmani et al., Scalable K-Means++, VLDB 2012
For information on how Spark k-means clustering is implemented, please see http://spark.apache.org/docs/latest/mllib-clustering.html#k-means.
Other Spark ML routines: ml_als_factorization
,
ml_decision_tree
,
ml_generalized_linear_regression
,
ml_gradient_boosted_trees
,
ml_lda
, ml_linear_regression
,
ml_logistic_regression
,
ml_multilayer_perceptron
,
ml_naive_bayes
,
ml_one_vs_rest
, ml_pca
,
ml_random_forest
,
ml_survival_regression