
Computes a "clara"
object, a list representing a clustering of
the data into k
clusters.
clara(x, k, metric = "euclidean", stand = FALSE, samples = 5,
sampsize = min(n, 40 + 2 * k), trace = 0, medoids.x = TRUE,
keep.data = medoids.x, rngR = FALSE, pamLike = FALSE, correct.d = TRUE)
data matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed.
integer, the number of clusters.
It is required that nrow(x)
).
character string specifying the metric to be used for calculating dissimilarities between observations. The currently available options are "euclidean" and "manhattan". Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences.
logical, indicating if the measurements in x
are
standardized before calculating the dissimilarities. Measurements
are standardized for each variable (column), by subtracting the
variable's mean value and dividing by the variable's mean absolute
deviation.
integer, number of samples to be drawn from the
dataset. The default, 5
, is rather small for historical (and
now back compatibility) reasons and we recommend to set
samples
an order of magnitude larger.
integer, number of observations in each
sample. sampsize
should be higher than the number of clusters
(k
) and at most the number of observations (n = nrow(x)
).
integer indicating a trace level for diagnostic output during the algorithm.
logical indicating if the medoids should be
returned, identically to some rows of the input data x
. If
FALSE
, keep.data
must be false as well, and the medoid
indices, i.e., row numbers of the medoids will still be returned
(i.med
component), and the algorithm saves space by needing
one copy less of x
.
logical indicating if the (scaled if
stand
is true) data should be kept in the result.
Setting this to FALSE
saves memory (and hence time), but
disables clusplot()
ing of the result. Use
medoids.x = FALSE
to save even more memory.
logical indicating if R's random number generator should
be used instead of the primitive clara()-builtin one. If true, this
also means that each call to clara()
returns a different result
-- though only slightly different in good situations.
logical indicating if the “swap” phase (see
pam
, in C code) should use the same algorithm as
pam()
. Note that from Kaufman and Rousseeuw's
description this should have been true always, but as the
original Fortran code and the subsequent port to C has always
contained a small one-letter change (a typo according to Martin Maechler)
with respect to PAM, the default, pamLike = FALSE
has been chosen to
remain back compatible rather than “PAM compatible”.
logical or integer indicating that---only in the case
of NA
s present in x
---the correct distance computation
should be used instead of the wrong formula which has been present
in the original Fortran code and been in use up to early 2016.
Because the new correct formula is not back compatible, for the time
being, a warning is signalled in this case, unless the user explicitly
specifies correct.d
.
an object of class "clara"
representing the clustering. See
clara.object
for details.
clara
is fully described in chapter 3 of Kaufman and Rousseeuw (1990).
Compared to other partitioning methods such as pam
, it can deal with
much larger datasets. Internally, this is achieved by considering
sub-datasets of fixed size (sampsize
) such that the time and
storage requirements become linear in
Each sub-dataset is partitioned into k
clusters using the same
algorithm as in pam
.
Once k
representative objects have been selected from the
sub-dataset, each observation of the entire dataset is assigned
to the nearest medoid.
The mean (equivalent to the sum) of the dissimilarities of the observations to their closest medoid is used as a measure of the quality of the clustering. The sub-dataset for which the mean (or sum) is minimal, is retained. A further analysis is carried out on the final partition.
Each sub-dataset is forced to contain the medoids obtained from the
best sub-dataset until then. Randomly drawn observations are added to
this set until sampsize
has been reached.
agnes
for background and references;
clara.object
, pam
,
partition.object
, plot.partition
.
# NOT RUN {
## generate 500 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
cbind(rnorm(300,50,8), rnorm(300,50,8)))
clarax <- clara(x, 2, samples=50)
clarax
clarax$clusinfo
## using pamLike=TRUE gives the same (apart from the 'call'):
all.equal(clarax[-8],
clara(x, 2, samples=50, pamLike = TRUE)[-8])
plot(clarax)
## `xclara' is an artificial data set with 3 clusters of 1000 bivariate
## objects each.
data(xclara)
(clx3 <- clara(xclara, 3))
## "better" number of samples
cl.3 <- clara(xclara, 3, samples=100)
## but that did not change the result here:
stopifnot(cl.3$clustering == clx3$clustering)
## Plot similar to Figure 5 in Struyf et al (1996)
# }
# NOT RUN {
plot(clx3, ask = TRUE)
# }
# NOT RUN {
## Try 100 times *different* random samples -- for reliability:
nSim <- 100
nCl <- 3 # = no.classes
set.seed(421)# (reproducibility)
cl <- matrix(NA,nrow(xclara), nSim)
for(i in 1:nSim)
cl[,i] <- clara(xclara, nCl, medoids.x = FALSE, rngR = TRUE)$cluster
tcl <- apply(cl,1, tabulate, nbins = nCl)
## those that are not always in same cluster (5 out of 3000 for this seed):
(iDoubt <- which(apply(tcl,2, function(n) all(n < nSim))))
if(length(iDoubt)) { # (not for all seeds)
tabD <- tcl[,iDoubt, drop=FALSE]
dimnames(tabD) <- list(cluster = paste(1:nCl), obs = format(iDoubt))
t(tabD) # how many times in which clusters
}
# }
Run the code above in your browser using DataLab