Creates a specification of a recipe step that will partition numeric variables according to k-medoids clustering and select the cluster medoids.
step_kmedoids(
recipe,
...,
k = 5,
center = TRUE,
scale = TRUE,
method = c("pam", "clara"),
metric = "euclidean",
optimize = FALSE,
num_samp = 50,
samp_size = 40 + 2 * k,
replace = TRUE,
prefix = "KMedoids",
role = "predictor",
skip = FALSE,
id = recipes::rand_id("kmedoids")
)# S3 method for step_kmedoids
tunable(x, ...)
Function step_kmedoids
creates a new step whose class is of
the same name and inherits from step_sbf
, adds it to the
sequence of existing steps (if any) in the recipe, and returns the updated
recipe. For the tidy
method, a tibble with columns terms
(selectors or variables selected), cluster
assignments,
selected
(logical indicator of selected cluster medoids),
silhouette
(silhouette values), and name
of the selected
variable names.
recipe object to which the step will be added.
one or more selector functions to choose which variables will be
used to compute the components. See selections
for
more details. These are not currently used by the tidy
method.
number of k-medoids clusterings of the variables. The value of
k
is constrained to be between 1 and one less than the number of
original variables.
logicals indicating whether to mean center and median absolute deviation scale the original variables prior to cluster partitioning, or functions or names of functions for the centering and scaling; not applied to selected variables.
character string specifying one of the clustering methods
provided by the cluster package. The clara
(clustering
large applications) method is an extension of pam
(partitioning
around medoids) designed to handle large datasets.
character string specifying the distance metric for calculating
dissimilarities between observations as "euclidean"
,
"manhattan"
, or "jaccard"
(clara
only).
logical indicator or 0:5 integer level specifying
optimization for the pam
clustering method.
number of sub-datasets to sample for the
clara
clustering method.
number of cases to include in each sub-dataset.
logical indicating whether to replace the original variables.
if the original variables are not replaced, the selected variables are added to the dataset with the character string prefix added to their names; otherwise, the original variable names are retained.
analysis role that added step variables should be assigned. By default, they are designated as model predictors.
logical indicating whether to skip the step when the recipe is
baked. While all operations are baked when prep
is
run, some operations may not be applicable to new data (e.g. processing
outcome variables). Care should be taken when using skip = TRUE
as
it may affect the computations for subsequent operations.
unique character string to identify the step.
step_kmedoids
object.
K-medoids clustering partitions variables into k groups such that the dissimilarity between the variables and their assigned cluster medoids is minimized. Cluster medoids are then returned as a set of k variables.
Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. Wiley.
Reynolds, A., Richards, G., de la Iglesia, B., & Rayward-Smith, V. (1992). Clustering rules: A comparison of partitioning and hierarchical clustering algorithms. Journal of Mathematical Modelling and Algorithms, 5, 475-504.
library(recipes)
rec <- recipe(rating ~ ., data = attitude)
kmedoids_rec <- rec %>%
step_kmedoids(all_predictors(), k = 3)
kmedoids_prep <- prep(kmedoids_rec, training = attitude)
kmedoids_data <- bake(kmedoids_prep, attitude)
pairs(kmedoids_data, lower.panel = NULL)
tidy(kmedoids_rec, number = 1)
tidy(kmedoids_prep, number = 1)
Run the code above in your browser using DataLab