KMeansTrainer: K-Means Trainer

Description

Trains a k-means machine learning model in R

Arguments

Public fields

clusters: the number of clusters
batch_size: the size of the mini batches
num_init: number of times the algorithm will be run with different centroid seeds
max_iters: the maximum number of clustering iterations
init_fraction: percentage of data to use for the initialization centroids (applies if initializer is kmeans++ or optimal_init). Should be a float number between 0.0 and 1.0.
initializer: the method of initialization. One of, optimal_init, quantile_init, kmeans++ and random.
early_stop_iter: continue that many iterations after calculation of the best within-cluster-sum-ofsquared-error
verbose: either TRUE or FALSE, indicating whether progress is printed during clustering
centroids: a matrix of initial cluster centroids. The rows of the CENTROIDS matrix should be equal to the number of clusters and the columns should be equal to the columns of the data
tol: a float number. If, in case of an iteration (iteration > 1 and iteration < max_iters) "tol" is greater than the squared norm of the centroids, then kmeans has converged
tol_optimal_init: tolerance value for the ’optimal_init’ initializer. The higher this value is, the far appart from each other the centroids are.
seed: integer value for random number generator (RNG)
model: use for internal purpose
max_clusters: either a numeric value, a contiguous or non-continguous numeric vector specifying the cluster search space

Methods

Public methods

Method `new()`

Usage

KMeansTrainer$new(
  clusters,
  batch_size = 10,
  num_init = 1,
  max_iters = 100,
  init_fraction = 1,
  initializer = "kmeans++",
  early_stop_iter = 10,
  verbose = FALSE,
  centroids = NULL,
  tol = 1e-04,
  tol_optimal_init = 0.3,
  seed = 1,
  max_clusters = NA
)

Arguments

clusters: numeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.

batch_size

nuemric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.

num_init

integer, use top features sorted by count to be used in bag of words matrix.

max_iters

character, regex expression to use for text cleaning.

init_fraction

list, a list of stopwords to use, by default it uses its inbuilt list of standard stopwords

initializer

character, splitting criteria for strings, default: " "

early_stop_iter

continue that many iterations after calculation of the best within-cluster-sum-ofsquared-error

verbose

either TRUE or FALSE, indicating whether progress is printed during clustering

centroids

a matrix of initial cluster centroids. The rows of the CENTROIDS matrix should be equal to the number of clusters and the columns should be equal to the columns of the data

tol

a float number. If, in case of an iteration (iteration > 1 and iteration < max_iters) "tol" is greater than the squared norm of the centroids, then kmeans has converged

tol_optimal_init

tolerance value for the ’optimal_init’ initializer. The higher this value is, the far appart from each other the centroids are.

seed

integer value for random number generator (RNG)

max_clusters

either a numeric value, a contiguous or non-continguous numeric vector specifying the cluster search space

Details

Create a new `KMeansTrainer` object.

Returns

A `KMeansTrainer` object.

Examples

data <- rbind(replicate(20, rnorm(1e4, 2)),
             replicate(20, rnorm(1e4, -1)),
             replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)

Method `fit()`

Usage

KMeansTrainer$fit(X, y = NULL, find_optimal = FALSE)

Arguments

X: data.frame or matrix containing features

y

NULL only kept here for superml's standard way

find_optimal

logical, to find the optimal clusters automatically

Details

Trains the KMeansTrainer model

Returns

NULL

Examples

data <- rbind(replicate(20, rnorm(1e4, 2)),
             replicate(20, rnorm(1e4, -1)),
             replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
km_model$fit(data, find_optimal = FALSE)

Method `predict()`

Usage

KMeansTrainer$predict(X)

Arguments

X: data.frame or matrix

Details

Returns the prediction on test data

Returns

a vector of predictions

Examples

data <- rbind(replicate(20, rnorm(1e4, 2)),
             replicate(20, rnorm(1e4, -1)),
             replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
km_model$fit(data, find_optimal = FALSE)
predictions <- km_model$predict(data)

Method `clone()`

The objects of this class are cloneable with this method.

Usage

KMeansTrainer$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Details

Trains a unsupervised K-Means clustering algorithm. It borrows mini-batch k-means function from ClusterR package written in c++, hence it is quite fast.

Examples

Run this code


## ------------------------------------------------
## Method `KMeansTrainer$new`
## ------------------------------------------------

data <- rbind(replicate(20, rnorm(1e4, 2)),
             replicate(20, rnorm(1e4, -1)),
             replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)

## ------------------------------------------------
## Method `KMeansTrainer$fit`
## ------------------------------------------------

data <- rbind(replicate(20, rnorm(1e4, 2)),
             replicate(20, rnorm(1e4, -1)),
             replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
km_model$fit(data, find_optimal = FALSE)

## ------------------------------------------------
## Method `KMeansTrainer$predict`
## ------------------------------------------------

data <- rbind(replicate(20, rnorm(1e4, 2)),
             replicate(20, rnorm(1e4, -1)),
             replicate(20, rnorm(1e4, 5)))
km_model <- KMeansTrainer$new(clusters=2, batch_size=30, max_clusters=6)
km_model$fit(data, find_optimal = FALSE)
predictions <- km_model$predict(data)

Run the code above in your browser using DataLab

Description

Arguments

Public fields

Methods

Public methods

Method new()

Usage

Arguments

Details

Returns

Examples

Method fit()

Usage

Arguments

Details

Returns

Examples

Method predict()

Usage

Arguments

Details

Returns

Examples

Method clone()

Usage

Arguments

Details

Examples

Method `new()`

Method `fit()`

Method `predict()`

Method `clone()`