dbs: Density-based silhouette information methods

Description

Computes the density-based silhouette information of clustered data. Two methods are associated to this function. The first method applies to two arguments: the matrix of data and the vector of cluster labels; the second method applies to objects of pdfCluster-class.

Usage

# S4 method for matrix
dbs(x, clusters, h.funct="h.norm", hmult=1, prior, ...)
# S4 method for pdfCluster
dbs(x, h.funct="h.norm", hmult = 1, prior = 
   as.vector(table(x@cluster.cores)/sum(table(x@cluster.cores))), 
   stage=NULL, ...)

Value

An object of class "dbs", with slots:

call: The matched call.
x: The matrix of clustered data points.
prior: The vector of prior probabilities of belonging to the groups.
dbs: A vector reporting the density-based silhouette information of the clustered data.
clusters: Cluster labels of grouped data.
noc: Number of clusters
stage: If argument x of dbs is a pdfCluster-class object, this slot provides the stage of the classification at which the dbs is computed.

See dbs-class for more details.

Arguments

x: A matrix of data points partitioned by any density-based clustering method or an object of pdfCluster-class.
clusters: Cluster labels of grouped data. This argument has not to be set when x is a pdfCluster-class object.
h.funct: Function to estimate the smoothing parameters. Default is h.norm.
hmult: Shrink factor to be multiplied by the smoothing parameters. Default value is 1.
prior: Vector of prior probabilities of belonging to the groups. When x is of pdfCluster-class, default value is set proportional to the cluster cores cardinalities. Otherwise, equal prior probabilities are given to the clusters by default.
stage: When x is a pdfCluster-class object, this is the stage of classification of low-density data at which the dbs has to be computed. Default value is the number of stages of the procedure. Set it to 0 if the dbs has to be computed at cluster cores only.
...: Further arguments to be passed to methods (see dbs-methods) or arguments to kepdf. See details below.

Methods

signature(x = "matrix", clusters = "numeric")

Computes the density based silhouette information for objects partitioned according to any density-based clustering method.

signature(x = "pdfCluster", clusters = "missing")

Computes the density based silhouette information for objects of class "pdfCluster"

Details

This function provides diagnostics for a clustering produced by any density-based clustering method. The dbs information is a suitable modification of the silhouette information aimed at evaluating the cluster quality in a density based framework. It is based on the estimation of data posterior probabilities of belonging to the clusters. It may be used to measure the quality of data allocation to the clusters. High values of the $\hat{dbs}$ are evidence of a good quality clustering.

Define $$ \hat{\tau}_m(x_i)=\frac{\pi_{m} \hat{f}(x_i|x_ \in m)}{\sum_{m=1}^M \pi_{m}\hat{f}(x_i|x_i \in m)} \quad m=1,\ldots,M, $$

where $\pi_{m}$ is a prior probability of $m$ and $\hat{f}(x_i|x_i \in m)$ is a density estimate at $x_i$ evaluated with function kepdf by using the only data points in $m$. Density estimation is performed with fixed bandwidths h, as evaluated by function h.funct, possibly multiplied by the shrink factor hmult.

Density-based silhouette information of $x_i$, the $i^{th}$ row of the data matrix x, is defined as follows: $$ \hat{dbs}_i=\frac{\log\left(\frac{\hat{\tau}_{m_{0}}(x_i)}{\hat{\tau}_{m_{1}}(x_i)}\right)}{{\max}_{x_i }\left| \log\left(\frac{\hat{\tau}_{m_{0}}(x_i)}{\hat{\tau}_{m_{1}}(x_i)}\right)\right|}, $$ where $m_0$ is the group where $x_i$ has been allocated and $m_1$ is the group for which $\tau_m$ is maximum, $m\neq m_0$.

Note: when there exists $x_j$ such that $\hat{\tau}_{m_{1}}(x_j)$ is zero, $\hat{dbs}_j$ is forced to 1 and ${\max}_{x_i }\left| \log\left(\frac{\hat{\tau}_{m_{0}}(x_i)}{\hat{\tau}_{m_{1}}(x_i)}\right)\right|$ is computed by excluding $x_j$ from the data matrix x.

See Menardi (2011) for a detailed treatment.

References

Menardi, G. (2011) Density-based Silhouette diagnostics for clustering methods. Statistics and Computing, 21, 295-308.

Examples

Run this code

#example 1: no groups in data
#random generation of group labels
set.seed(54321)
x <- rnorm(50)
groups <- sample(1:2, 50, replace = TRUE)
groups
dsil <- dbs(x = as.matrix(x), clusters=groups)
dsil
summary(dsil)
plot(dsil, labels=TRUE, lwd=6)

#example 2: wines data
# load data
data(wine)

# select a subset of variables
x <- wine[, c(2,5,8)]

#clustering
cl <- pdfCluster(x)
 
dsil <- dbs(cl)
plot(dsil)

Run the code above in your browser using DataLab