
Perform k-means clustering on a data matrix.
kmeans(x, centers, iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"), trace=FALSE)
# S3 method for kmeans
fitted(object, method = c("centers", "classes"), ...)
numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).
either the number of clusters, say x
is chosen as the initial centres.
the maximum number of iterations allowed.
if centers
is a number, how many random sets
should be chosen?
character: may be abbreviated. Note that
"Lloyd"
and "Forgy"
are alternative names for one
algorithm.
an R object of class "kmeans"
, typically the
result ob
of ob <- kmeans(..)
.
character: may be abbreviated. "centers"
causes
fitted
to return cluster centers (one for each input point) and
"classes"
causes fitted
to return a vector of class
assignments.
logical or integer number, currently only used in the
default method ("Hartigan-Wong"
): if positive (or true),
tracing information on the progress of the algorithm is
produced. Higher values may produce more tracing information.
not used.
kmeans
returns an object of class "kmeans"
which has a
print
and a fitted
method. It is a list with at least
the following components:
A vector of integers (from 1:k
) indicating the cluster to
which each point is allocated.
A matrix of cluster centres.
The total sum of squares.
Vector of within-cluster sum of squares, one component per cluster.
Total within-cluster sum of squares,
i.e.sum(withinss)
.
The between-cluster sum of squares,
i.e.totss-tot.withinss
.
The number of points in each cluster.
The number of (outer) iterations.
integer: indicator of a possible algorithm problem -- for experts.
The data given by x
are clustered by the
The algorithm of Hartigan and Wong (1979) is used by default. Note
that some authors use nstart
x
) are extremely close, the algorithm may not converge
in the “Quick-Transfer” stage, signalling a warning (and
returning ifault = 4
). Slight
rounding of the data may be advisable in that case.
For ease of programmatic exploration, withinss
.
Except for the Lloyd--Forgy method,
Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 21, 768--769.
Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100--108. 10.2307/2346830.
Lloyd, S. P. (1957, 1982). Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory, 28, 128--137.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp.281--297. Berkeley, CA: University of California Press.
# NOT RUN {
require(graphics)
# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)
# sum of squares
ss <- function(x) sum(scale(x, scale = FALSE)^2)
## cluster centers "fitted" to each obs.:
fitted.x <- fitted(cl); head(fitted.x)
resid.x <- x - fitted(cl)
## Equalities : ----------------------------------
cbind(cl[c("betweenss", "tot.withinss", "totss")], # the same two columns
c(ss(fitted.x), ss(resid.x), ss(x)))
stopifnot(all.equal(cl$ totss, ss(x)),
all.equal(cl$ tot.withinss, ss(resid.x)),
## these three are the same:
all.equal(cl$ betweenss, ss(fitted.x)),
all.equal(cl$ betweenss, cl$totss - cl$tot.withinss),
## and hence also
all.equal(ss(x), ss(fitted.x) + ss(resid.x))
)
kmeans(x,1)$withinss # trivial one-cluster, (its W.SS == ss(x))
## random starts do help here with too many clusters
## (and are often recommended anyway!):
(cl <- kmeans(x, 5, nstart = 25))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:5, pch = 8)
# }
Run the code above in your browser using DataLab