kmpp: Initialization of cluster prototypes using K-means++ algorithm

Description

Initializes the cluster prototypes matrix by using K-means++ algorithm which has been proposed by Arthur and Vassilvitskii (2007).

Usage

kmpp(x, k)

Arguments

a numeric vector, data frame or matrix.

an integer specifying the number of clusters.

Value

an object of class ‘inaparc’, which is a list consists of the following items:

a numeric matrix containing the initial cluster prototypes.

ctype

a string representing the type of centroid, which used to build prototype matrix. Its value is ‘obj’ with this function because the cluster prototypes are the objects selected by the algorithm.

call

a string containing the matched function call that generates this sQuoteinaparc object.

Details

K-means++ (Arthur & Vassilvitskii, 2007) is usually reported as an efficient approximation algorithm in overcoming the poor clustering problem with the standard K-means algorithm. K-means++ is an algorithm that merges MacQueen's second method with the ‘Maximin’ method to initialize the cluster prototypes (Ji et al, 2015). K-means++ initializes the cluster centroids by finding the data objects that are farther away from each other in a probabilistic manner. In K-means++, the first cluster protoype (center) is randomly assigned. The prototypes of remaining clusters are determined with a probability of \({md(x')}^2/\sum_{k=1}^{n} md({x_k})^2\), where \(md(x)\) is the minimum distance between a data object and the previously computed prototypes.

The function kmpp is an implementation of the initialization algorithm of K-means++ that is based on the code‘k-meansp2.R’, authored by M. Sugiyama. It needs less execution time due to its vectorized distance computations.

References

Arthur, D. & Vassilvitskii. S. (2007). K-means++: The advantages of careful seeding, in Proc. of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, p. 1027-1035. url:http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

M. Sugiyama, ‘mahito-sugiyama/k-meansp2.R’. url:https://gist.github.com/mahito-sugiyama/ef54a3b17fff4629f106

Examples

Run this code

# NOT RUN {
data(iris)
res <- kmpp(x=iris[,1:4], k=5)
v <- res$v
print(v)
# }

Run the code above in your browser using DataLab