silhouette: Compute or Extract Silhouette Information from Clustering

Description

Compute silhouette information according to a given clustering in $k$ clusters.

Usage

silhouette(x, ...)
## S3 method for class 'default':
silhouette(x, dist, dmatrix, \dots)
## S3 method for class 'partition':
silhouette(x, \dots)
sortSilhouette(object, ...)
## S3 method for class 'silhouette':
summary(object, FUN = mean, \dots)
## S3 method for class 'silhouette':
plot(x, nmax.lab = 40, max.strlen = 5,
     main = NULL, sub = NULL, xlab = expression("Silhouette width "* s[i]),
     col = "gray",  do.col.sort = length(col) > 1, border = 0,
     cex.names = par("cex.axis"), do.n.k = TRUE, do.clus.stat = TRUE, ...)

Arguments

an object of appropriate class; for the default method an integer vector with $k$ different integer cluster codes or a list with such an x$clustering component. Note that silhouette statistics are only defined if

dist

a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.

dmatrix

a symmetric dissimilarity matrix ($n \times n$), specified instead of dist, which can be more efficient.

object

an object of class silhouette.

...

further arguments passed to and from methods.

FUN

function used summarize silhouette widths.

nmax.lab

integer indicating the number of labels which is considered too large for single-name labeling the silhouette plot.

max.strlen

positive integer giving the length to which strings are truncated in silhouette plot labeling.

main, sub, xlab

arguments to title; have a sensible non-NULL default here.

col, border, cex.names

arguments passed barplot(); note that the default used to be

col
      = heat.colors(n), border = par("fg")

instead. col can also be a color vector of length $k$ for cluste

do.col.sort

logical indicating if the colors col should be sorted ``along'' the silhouette; this is useful for casewise or clusterwise coloring.

do.n.k

logical indicating if $n$ and $k$ ``title text'' should be written.

do.clus.stat

logical indicating if cluster size and averages should be written right to the silhouettes.

Value

silhouette() returns an object, sil, of class silhouette which is an [n x 3] matrix with attributes. For each observation i, sil[i,] contains the cluster to which i belongs as well as the neighbor cluster of i (the cluster, not containing i, for which the average dissimilarity between its observations and i is minimal), and the silhouette width $s(i)$ of the observation. The colnames correspondingly are c("cluster", "neighbor", "sil_width").
summary(sil) returns an object of class summary.silhouette, a list with components
si.summarynumerical summary of the individual silhouette widths $s(i)$.
clus.avg.widthsnumeric (rank 1) array of clusterwise means of silhouette widths where mean = FUN is used.
avg.widththe total mean FUN(s) where s are the individual silhouette widths.
clus.sizestable of the $k$ cluster sizes.
callif available, the call creating sil.
Orderedlogical identical to attr(sil, "Ordered"), see below.
sortSilhouette(sil) orders the rows of sil as in the silhouette plot, by cluster (increasingly) and decreasing silhouette width $s(i)$. attr(sil, "Ordered") is a logical indicating if sil is ordered as by sortSilhouette(). In that case, rownames(sil) will contain case labels or numbers, and attr(sil, "iOrd") the ordering index vector.

Details

For each observation i, the silhouette width $s(i)$ is defined as follows: Put a(i) = average dissimilarity between i and all other points of the cluster to which i belongs. For all other clusters C, put $d(i,C)$ = average dissimilarity of i to all observations of C. The smallest of these $d(i,C)$ is $b(i) := \min_C d(i,C)$, and can be seen as the dissimilarity between i and its ``neighbor'' cluster, i.e., the nearest one to which it does not belong. Finally, $$s(i) := \frac{b(i) - a(i) }{max(a(i), b(i))}.$$

Observations with a large $s(i)$ (almost 1) are very well clustered, a small $s(i)$ (around 0) means that the observation lies between two clusters, and observations with a negative $s(i)$ are probably placed in the wrong cluster.

References

Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53--65.

chapter 2 of Kaufman, L. and Rousseeuw, P.J. (1990), see the references in plot.agnes.

Examples

Run this code

data(ruspini)
 pr4 <- pam(ruspini, 4)
 str(si <- silhouette(pr4))
 (ssi <- summary(si))
 plot(si) # silhouette plot

 si2 <- silhouette(pr4$clustering, dist(ruspini, "canberra"))
 summary(si2) # has small values: "canberra"'s fault
 plot(si2, nmax= 80, cex.names=0.6)

 par(mfrow = c(3,2), oma = c(0,0, 3, 0))
 for(k in 2:6)
    plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE)
 mtext("PAM(Ruspini) as in Kaufman & Rousseeuw, p.101",
       outer = TRUE, font = par("font.main"), cex = par("cex.main"))

 ## Silhouette for a hierarchical clustering:
 ar <- agnes(ruspini)
 si3 <- silhouette(cutree(ar, k = 5), # k = 4 gave the same as pam() above
     	           daisy(ruspini))
 plot(si3, nmax = 80, cex.names = 0.5)
 ## 2 groups: Agnes() wasn't too good:
 si4 <- silhouette(cutree(ar, k = 2), daisy(ruspini))
 plot(si4, nmax = 80, cex.names = 0.5)

Run the code above in your browser using DataLab