k-means cluster analysis without the memory overhead, and possibly in parallel using shared memory.
bigkmeans(x, centers, iter.max = 10, nstart = 1, dist = "euclid")
An object of class kmeans
, just as produced by
kmeans
.
a big.matrix
object.
a scalar denoting the number of clusters, or for k clusters,
a k by ncol(x)
matrix.
the maximum number of iterations.
number of random starts, to be done in parallel if there is a registered backend (see below).
the distance function. Can be "euclid" or "cosine".
The real benefit is the lack of memory overhead compared to the
standard kmeans
function. Part of the overhead from
kmeans()
stems from the way it looks for unique starting
centers, and could be improved upon. The bigkmeans()
function
works on either regular R matrix
objects, or on big.matrix
objects. In either case, it requires no extra memory (beyond the data,
other than recording the cluster memberships), whereas kmeans()
makes at least two extra copies of the data. And kmeans()
is even
worse if multiple starts (nstart>1
) are used. If nstart>1
and you are using bigkmeans()
in parallel, a vector of cluster
memberships will need to be stored for each worker, which could be
memory-intensive for large data. This isn't a problem if you use are running
the multiple starts sequentially.
Unless you have a really big data set (where a single run of
kmeans
not only burns memory but takes more than a few
seconds), use of parallel computing for multiple random starts is unlikely
to be much faster than running iteratively.
Only the algorithm by MacQueen is used here.