pdfCluster: Clustering via nonparametric density estimation

Description

Cluster analysis is performed by the density-based procedures described in Azzalini and Torelli (2007) and Menardi and Azzalini (2014), and summarized in Azzalini and Menardi (2014).

Usage

# S4 method for numeric
pdfCluster(x, graphtype, hmult, Q = "QJ", lambda = 0.1, 
   grid.pairs = 10, n.grid = min(round((5 + sqrt(NROW(x))) * 4), NROW(x)), ...)
# S4 method for matrix
pdfCluster(x, graphtype, hmult, Q = "QJ", lambda = 0.1, 
	grid.pairs = 10, n.grid = min(round((5 + sqrt(NROW(x))) * 4), NROW(x)), ...)
# S4 method for data.frame
pdfCluster(x, graphtype, hmult, Q = "QJ", lambda = 0.1, 
	grid.pairs = 10, n.grid = min(round((5 + sqrt(NROW(x))) * 4), NROW(x)), ...)
	  
# S4 method for pdfCluster
pdfCluster(x, graphtype, hmult, Q, lambda = 0.1, 
	grid.pairs, n.grid = min(round((5 + sqrt(NROW(x@x))) * 4), NROW(x@x)), ...)

Value

An S4 object of pdfCluster-class with slots:

call

The matched call.

x

The matrix of data input. If a vector of data is provided as input, a one-column matrix is returned as output.

pdf

An object of class list providing information about the density estimate. It includes:

kernel character vector defining the kernel function used to estimate the density;
bwtype character vector defining if a fixed or an adaptive kernel estimator has been used;
par list of components defining the parameters used in density estimation;
estimate vector of density estimates evaluated at the data points.

See kepdf for further details.

nc

An object of class list defining details about the identification of the connected regions. It includes:

nc number of connected sets for each density level set.
p vector of level sets, giving the proportions of data with estimated density below a threshold.
id group label of each point at different sections of the density estimate. Negative values of id mean that the estimated density is below the considered threshold.
pq for each p gives the corresponding quantile q of the values of the density.

graph

An object of class list defining details about the graph built to find the connected sets of high density regions. Its length depends on the value of its first element:

type either "unidimensional", "delaunay" or "pairs", defines the procedure used to set edges among the observations. In the last case only, the list includes also the following elements:
comp.area a list containing the vector area and the matrix pairs.ord. The element \(i\) of vector area is the measure of the maximum valley in the density function linking the observations having row position as given in column \(i\) of pairs.ord.
lambda tolerance threshold.

cluster.cores

A vector with the same length as NROW(x), defining the cluster cores membership. NA values correspond to low density, unlabeled data, to be classified in the second phase of the procedure by the intarnal call of pdfClassification.

tree

Cluster tree with leaves corresponding to the connected components associated to different sections of the density estimate. The object is of class dendrogram.

noc

Number of clusters.

stages

List with elements corresponding to the data allocation to groups at the different stages of the classification procedure. NA values correspond to unlabeled data.

clusters

Set to NULL if n.stages = 0, that is, if data belonging to the cluster cores only have been allocated. Otherwise it reports the final label groups. This component is obsolete. Use function groups, instead.

Arguments

x: A vector, a matrix or a data frame of numeric data to be partitioned. Since density-based clustering is designed for continuous data only, if discrete data are provided, a warning message is displayed. Alternatively, x can be an object of pdfCluster-class itself, obtained when graphtype is set to "pairs". See Section Details below.
graphtype: Either "unidimensional", "delaunay" or "pairs", it defines the procedure used to build the graph associated with the data. If missing, a "delaunay" graph is built for data having dimension less than 7, otherwise a "pairs" graph is built. See details below. This argument has not to be set when x is of pdfCluster-class.
hmult: A shrink factor to be multiplied by the smoothing parameter h of function kepdf. If missing, it is taken to be 1 when data have dimension greater than 6, 0.75 otherwise.
Q: Optional arguments to be given when graphtype = "delaunay". See delaunayn in package geometry for further details. This argument has not to be set when graphtype = "pairs", when graphtype = "unidimensional" or when x is of pdfCluster-class.
lambda: Tolerance threshold to be used when graphtype = "pairs". An edge is set between two observations if the density function, evaluated along the segment linking them, does not exhibit any valley having a measure exceeding lambda. Its range is \([0,1)\) but a value larger than 0.3 is not recommended; default value is set to 0.10. This argument has not to be set when graphtype = "delaunay" or graphtype = "unidimensional".
grid.pairs: When graphtype = "pairs", this arguments defines the length of the grid of points along the segment linking each pair of observations, on which the density is evaluated. Default is 10. This argument has not to be set when graphtype = "delaunay", when graphtype = "unidimensional" or when x is of pdfCluster-class.
n.grid: Defines the length of the grid on which evaluating the connected components of the density level sets. The default value is set to the minimum between the number of data rows \(n\) and \(\lfloor{(5 + \sqrt(n))4 + 0.5}\rfloor\), an empirical rule of thumb which indicates that the length of the grid grows with the square root of the number of rows data.

...: Further arguments to be passed to kepdf or to pdfClassification.

Methods

signature(x="data.frame")
signature(x="matrix")
signature(x="numeric")
signature(x="pdfCluster")

Warning

It may happen that the variability of the estimated density is so high that not all jumps in the mode function can be detected by the selected grid scanning the density function. In that case, no output is produced and a message is displayed. As this may be associated to the occurrence of some spurious connected components, which appear and disappear within the range between two subsequent values of the grid, a natural solution is to increase the value of n.grid. Alternatively either lambda or hmult may be increased to alleviate the chance of detecting spurious connected components.

Using graphtype= 'delaunay' when the dimensionality \(d\) of data is greater than 6 is highly time-consuming unless the number of rows \(n\) is fairly small, since the number of operations to run the Delaunay triangulation grows exponentially with \(d\). Use graphtype= "pairs", instead, whose computational complexity grows quadratically with the number of observations.

Details

Clusters are associated to the connected components of the level sets of the density underlying the data. Density estimation is performed by kernel methods and the connected regions are approximated by the connected components of a graph built on data. Three alternative procedures to build the graph are adopted:

Unidimensional procedure: When data are univariate an edge is set between two observations when all the data points included in the segment between the two candidate observations belong to the same level set.
Delaunay triangulation: An edge is set between two observations when they are contiguous in the Voronoi diagram; see Azzalini and Torelli (2007).
Pairs evaluation: An edge is set between two observations when the density function, evaluated along the segment joining them, does not exhibit any valley having a relative amplitude greater than a tolerance threshold \(0 \le \lambda < 1\). Being a tolerance threshold, sensible values of \(\lambda\) are, in practice, included in \([0, 0.3]\); see Menardi and Azzalini (2013).

As the level set varies, the number of detected components gives rise to the tree of clusters, where each leave corresponds to a mode of the density function. Observations around these modes form a number of cluster cores, while the lower density observations are allocated according to a classification procedure; see also pdfClassification.

References

Azzalini, A., Menardi, G. (2014). Clustering via nonparametric density estimation: the R package pdfCluster. Journal of Statistical Software, 57(11), 1-26, URL http://www.jstatsoft.org/v57/i11/.

Azzalini A., Torelli N. (2007). Clustering via nonparametric density estimation. Statistics and Computing. 17, 71-80.

Menardi, G., Azzalini, A. (2014). An advancement in clustering via nonparametric density estimation. Statistics and Computing. DOI: 10.1007/s11222-013-9400-x.

Examples

Run this code

##########
#example 1
###########
# not run here for time reasons
#loading data
data(oliveoil)

#preparing data
olive1 <- 1 + oliveoil[, 3:10]
margin <- apply(data.matrix(olive1),1,sum)
olive1 <- olive1/margin
alr <- (-log( olive1[, -4]/olive1[, 4]))
#select the first 5 principal components
x <- princomp(alr, cor=TRUE)$scores[, 1:5]

#clustering
# not run here for time reasons
#cl <- pdfCluster(x, h = h.norm(x), hmult=0.75)
#summary(cl)
#plot(cl)

#comparing groups with original macro-area membership
#groups <- groups(cl)
#table(oliveoil$macro.area, groups)

#cluster cores
#table(groups(cl, stage = 0))

##########
#example 2
###########
# not run here for time reasons
# loading data
#data(wine)
#x <-wine[ ,-1]
#gr <- wine[ ,1]

# when data are high-dimensional, an adaptive kernel estimator is preferable 
# building the Delaunay graph entails a too high computational effort
# use option "pairs" to build the graph 
# it is the default option for dimension >6 


# cl <- pdfCluster(x, graphtype="pairs", bwtype="adaptive")
# summary(cl)
# plot(cl)

#comparison with original groups
#table(groups(cl),gr)

# a better classification is obtained with larger value of lambda
# not necessary to run the whole procedure again
# a pdfCluster method applies on pdfCluster-class objects!

#cl1 <- pdfCluster(cl, lambda=0.25)
#table(gr, groups(cl1))

Run the code above in your browser using DataLab