mergenormals: Clustering by merging Gaussian mixture components

Description

Clustering by merging Gaussian mixture components; computes all methods introduced in Hennig (2010) from an initial mclust clustering. See details section for details.

Usage

mergenormals(xdata, mclustsummary=NULL, 
                         clustering, probs, muarray, Sigmaarray, z,
                         method=NULL, cutoff=NULL, by=0.005,
                         numberstop=NULL, renumber=TRUE, M=50, ...)
  # S3 method for mergenorm
summary(object, ...)
  # S3 method for summary.mergenorm
print(x, ...)

Arguments

xdata

data (something that can be coerced into a matrix).

mclustsummary

output object from summary.mclustBIC for xdata. Either mclustsummary or all of clustering, probs, muarray, Sigmaarray and z need to be specified (the latter are obtained from mclustsummary if they are not provided). I am not aware of restrictions of the usage of mclustBIC to produce an initial clustering; covariance matrix models can be restricted and a noise component can be included if desired, although I have probably not tested all possibilities.

clustering

vector of integers. Initial assignment of data to mixture components.

probs

vector of component proportions (for all components; should sum up to one).

muarray

matrix of component means (rows).

Sigmaarray

array of component covariance matrices (third dimension refers to component number).

matrix of observation- (row-)wise posterior probabilities of belonging to the components (columns).

method

one of "bhat", "ridge.uni", "ridge.ratio", "demp", "dipuni", "diptantrum", "predictive". See details.

cutoff

numeric between 0 and 1. Tuning constant, see details and Hennig (2010). If not specified, the default values given in (9) in Hennig (2010) are used.

real between 0 and 1. Interval width for density computation along the ridgeline, used for methods "ridge.uni" and "ridge.ratio". Methods "dipuni" and "diptantrum" require ridgeline computations and use it as well.

numberstop

integer. If specified, cutoff is ignored and components are merged until the number of clusters specified here is reached.

renumber

logical. If TRUE merged clusters are renumbered from 1 to their number. If not, numbers of the original clustering are used (numbers of components that were merged into others then will not appear).

integer. Number of times the dataset is divided into two halves. Used if method="predictive".

...

additional optional parameters to pass on to ridgeline.diagnosis or mixpredictive (in mergenormals).

object

object of class mergenorm, output of mergenormals.

object of class summary.mergenorm, output of summary.mergenorm.

Value

mergenormals gives out an object of class mergenorm, which is a List with components

clustering

integer vector. Final clustering.

clusternumbers

vector of numbers of remaining clusters. These are given in terms of the original clusters even of renumber=TRUE, in which case they may be needed to understand the numbering of some further components, see below.

defunct.components

vector of numbers of components that were "merged away".

valuemerged

vector of values of the merging criterion (see details) at which components were merged.

mergedtonumbers

vector of numbers of clusters to which the original components were merged.

parameters

a list, if mclustsummary was provided. Entry no. i refers to number i in clusternumbers. The list entry i contains the parameters of the original mixture components that make up cluster i, as extracted by extract.mixturepars.

predvalues

vector of prediction strength values for clusternumbers from 1 to the number of components in the original mixture, if method=="predictive". See mixpredictive.

orig.decisionmatrix

square matrix with entries giving the original values of the merging criterion (see details) for every pair of original mixture components.

new.decisionmatrix

square matrix as orig.decisionmatrix, but with final entries; numbering of rows and columns corresponds to clusternumbers; all entries corresponding to other rows and columns can be ignored.

probs

final cluster values of probs (see arguments) for merged components, generated by (potentially repeated) execution of mergeparameters out of the original ones. Numbered according to clusternumbers.

muarray

final cluster means, analogous to probs.

Sigmaarray

final cluster covariance matrices, analogous to probs.

final matrix of posterior probabilities of observations belonging to the clusters, analogous to probs.

noise

logical. If TRUE, there was a noise component fitted in the initial mclust clustering (see help for initialization in mclustBIC). In this case, a cluster number 0 indicates noise. noise is ignored by the merging methods and kept as it was originally.

method

as above.

cutoff

as above.

summary.mergenorm gives out a list with components clustering, clusternumbers, defunct.components, valuemerged, mergedtonumbers, predvalues, probs, muarray, Sigmaarray, z, noise, method, cutoff as above, plus onc (original number of components) and mnc (number of clusters after merging).

Details

Mixture components are merged in a hierarchical fashion. The merging criterion is computed for all pairs of current clusters and the two clusters with the highest criterion value (lowest, respectively, for method="predictive") are merged. Then criterion values are recomputed for the merged cluster. Merging is continued until the criterion value to merge is below (or above, for method="predictive") the cutoff value. Details are given in Hennig (2010). The following criteria are offered, specified by the method-argument.

"ridge.uni": components are only merged if their mixture is unimodal according to Ray and Lindsay's (2005) ridgeline theory, see ridgeline.diagnosis. This ignores argument cutoff.
"ridge.ratio": ratio between density minimum between components and minimum of density maxima according to Ray and Lindsay's (2005) ridgeline theory, see ridgeline.diagnosis.
"bhat": Bhattacharyya upper bound on misclassification probability between two components, see bhattacharyya.matrix.
"demp": direct estimation of misclassification probability between components, see Hennig (2010).
"dipuni": this uses method="ridge.ratio" to decide which clusters to merge but stops merging according to the p-value of the dip test computed as in Hartigan and Hartigan (1985), see dip.test.
"diptantrum": as "dipuni", but p-value of dip test computed as in Tantrum, Murua and Stuetzle (2003), see dipp.tantrum.
"predictive": this uses method="demp" to decide which clusters to merge but stops merging according to the value of prediction strength (Tibshirani and Walther, 2005) as computed in mixpredictive.

References

J. A. Hartigan and P. M. Hartigan (1985) The Dip Test of Unimodality, Annals of Statistics, 13, 70-84.

Hennig, C. (2010) Methods for merging Gaussian mixture components, Advances in Data Analysis and Classification, 4, 3-34.

Ray, S. and Lindsay, B. G. (2005) The Topography of Multivariate Normal Mixtures, Annals of Statistics, 33, 2042-2065.

Tantrum, J., Murua, A. and Stuetzle, W. (2003) Assessment and Pruning of Hierarchical Model Based Clustering, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C., 197-205.

Tibshirani, R. and Walther, G. (2005) Cluster Validation by Prediction Strength, Journal of Computational and Graphical Statistics, 14, 511-528.

Examples

Run this code

# NOT RUN {
  require(mclust)
  require(MASS)
  options(digits=3)
  data(crabs)
  dc <- crabs[,4:8]
  cm <- mclustBIC(crabs[,4:8],G=9,modelNames="EEE")
  scm <- summary(cm,crabs[,4:8])
  cmnbhat <- mergenormals(crabs[,4:8],scm,method="bhat")
  summary(cmnbhat)
  cmndemp <- mergenormals(crabs[,4:8],scm,method="demp")
  summary(cmndemp)
# Other methods take a bit longer, but try them!
# The values of by and M below are still chosen for reasonably fast execution.
# cmnrr <- mergenormals(crabs[,4:8],scm,method="ridge.ratio",by=0.05)
# cmd <- mergenormals(crabs[,4:8],scm,method="dip.tantrum",by=0.05)
# cmp <- mergenormals(crabs[,4:8],scm,method="predictive",M=3)
# }

Run the code above in your browser using DataLab