hcm: Hard C-Means Clustering

Description

Partitions a numeric data set by using Hard C-Means (HCM) clustering algorithm (or K-Means) which has been proposed by MacQueen(1967). The function hcm is an extension of the basic kmeans with more input arguments and output values in order to make the clustering results comparable with those of other fuzzy and possibilistic algorithms. For instance, not only the Euclidean distance metric but also a number of distance metrics such as the squared Euclidean distance, the squared Chord distance etc. can be employed with the function hcm.

Usage

hcm(x, centers, dmetric="euclidean", pw=2, alginitv="kmpp",  
   nstart=1, iter.max=1000, con.val=1e-9, stand=FALSE, numseed)

Arguments

a numeric vector, data frame or matrix.

centers

an integer specifying the number of clusters or a numeric matrix containing the initial cluster centers.

dmetric

a string for the distance metric. The default is euclidean for the squared Euclidean distances. See get.dmetrics for the alternative options.

a number for the power of Minkowski distance calculation. The default is 2 if the dmetric is minkowski.

alginitv

a string for the initialization of cluster prototypes matrix. The default is kmpp for K-means++ initialization method (Arthur & Vassilvitskii, 2007). For the list of alternative options see get.algorithms.

nstart

an integer for the number of starts for clustering. The default is 1.

iter.max

an integer for the maximum number of iterations allowed. The default is 1000.

con.val

a number for the convergence value between the iterations. The default is 1e-09.

stand

a logical flag to standardize data. Its default value is FALSE. If its value is TRUE, the data matrix x is standardized.

numseed

a seeding number to set the seed of R's random number generator.

Value

an object of class ‘ppclust’, which is a list consists of the following items:

a numeric matrix containing the processed data set.

a numeric matrix containing the final cluster prototypes (centers of clusters).

a numeric matrix containing the hard membership degrees of the data objects.

a numeric matrix containing the distances of objects to the final cluster prototypes.

an integer for the number of clusters.

cluster

a numeric vector containing the cluster labels of the data objects.

csize

a numeric vector containing the number of objects in the clusters.

best.start

an integer for the index of start with the minimum objective functional.

iter

an integer vector for the number of iterations in each start of the algorithm.

func.val

a numeric vector for the objective function values of each start of the algorithm.

comp.time

a numeric vector for the execution time of each start of the algorithm.

wss

a numeric vector containing the within-cluster sum of squares for each cluster.

bwss

a number for the between-cluster sum of squares.

tss

a number for the total within-cluster sum of squares.

twss

a number for the total sum of squares.

stand

a logical value, TRUE shows that x data set contains the standardized values of raw data.

algorithm

a string for the name of partitioning algorithm. It is ‘HCM’ with this function.

call

a string for the matched function call generating this ‘ppclust’ object.

Details

Hard C-Means (HCM) clustering algorithm (or K-means) partitions a data set into k groups, so-called clusters. The objective function of HCM is:

\(J_{HCM}(\mathbf{X}; \mathbf{V})=\sum\limits_{i=1}^n d^2(\vec{x}_i, \vec{v}_j)\)

See ppclust-package for the details about the terms in the above equation of \(J_{HCM}\).

The update equation for membership degrees is:

\(u_{ij} = \left\{ \begin{array}{rl} 1 & if \; d^2(\vec{x}_i, \vec{v}_j) = min_{1\leq l\leq k} \; (d^2(\vec{x}_i, \vec{v}_l)) \\ 0 & otherwise \end{array} \right. \)

The update equation for cluster prototypes is:

\(\vec{v}_{j} =\frac{\sum\limits_{i=1}^n u_{ij} \vec{x}_i}{\sum\limits_{i=1}^n u_{ij}} \;\;; {1\leq j\leq k}\)

References

Arthur, D. & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding, in Proc. of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, p. 1027-1035. <http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf>

MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. In Proc. of 5th Berkeley Symp. on Mathematical Statistics and Probability, Berkeley, Univ. of California Press, 1: 281-297. <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.308.8619&rep=rep1&type=pdf>

Examples

Run this code

# NOT RUN {
# Load dataset iris 
data(iris)
x <- iris[,-5]

# Initialize the prototype matrix using K-means++
v <- inaparc::kmpp(x, k=3)$v

# Run HCM with the initial prototypes
res.hcm <- hcm(x, centers=v)

# Print, summarize and plot the clustering result
res.hcm$cluster
summary(res.hcm$cluster)
plot(x, col=res.hcm$cluster, pch=16)
# }

Run the code above in your browser using DataLab