dissimilarity: Dissimilarity Between Partitions or Hierarchies

Description

Compute the dissimilarity between (ensembles) of partitions or hierarchies.

Usage

cl_dissimilarity(x, y = NULL, method = "euclidean")

Arguments

an ensemble of partitions or hierarchies, or something coercible to that (see cl_ensemble).

NULL (default), or as for x.

method

a character string specifying one of the built-in methods for computing dissimilarity, or a function to be taken as a user-defined method. If a character string, its lower-cased version is matched against the lower-cased names of the availabl

Value

If y is NULL, an object of class "cl_dissimilarity" containing the dissimilarities between all pairs of components of x. Otherwise, an object of class "cl_cross_dissimilarity" with the dissimilarities between the components of x and the components of y.

Details

If y is given, its components must be of the same kind as those of x (i.e., components must either all be partitions, or all be hierarchies).

If all components are partitions, the following built-in methods for measuring dissimilarity between two partitions with respective membership matrices $u$ and $v$ (brought to a common number of columns) are available:

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

For hard partitions, both Manhattan and squared Euclidean dissimilarity give twice the transfer distance (Charon et al., 2005), which is the minimum number of objects that must be removed so that the implied partitions (restrictions to the remaining objects) are identical. This is also known as the $R$-metric in Day (1981), i.e., the number of augmentations and removals of single objects needed to transform one partition into the other, and the partition-distance in Gusfield (2002), and equals twice the number of single element moves distance of Boorman and Arabie.

If all components are hierarchies, available built-in methods for measuring agreement between two hierarchies with respective ultrametrics $u$ and $v$ are as follows.

[object Object],[object Object],[object Object],[object Object],[object Object]

If a user-defined agreement method is to be employed, it must be a function taking two clusterings as its arguments.

Symmetric dissimilarity objects of class "cl_dissimilarity" are implemented as symmetric proximity objects with self-proximities identical to zero, and inherit from class "cl_proximity". They can be coerced to dense square matrices using as.matrix. It is possible to use 2-index matrix-style subscripting for such objects; unless this uses identical row and column indices, this results in a (non-symmetric dissimilarity) object of class "cl_cross_dissimilarity".

Symmetric dissimilarity objects also inherit from class "dist" (although they currently do not strictly extend this class), thus making it possible to use them directly for clustering algorithms based on dissimilarity matrices of this class, see the examples.

References

S. A. Boorman and P. Arabie (1972). Structural measures and the method of sorting. In R. N. Shepard, A. K. Romney, & S. B. Nerlove (eds.), Multidimensional scaling: Theory and applications in the behavioral sciences, 1: Theory (pages 225--249). New York: Seminar Press. I. Charon and L. Denoeud and A. Guénoche and O. Hudry (2005). Maximum Transfer Distance Between Partitions. Technical Report 2005D003, Ecole Nationale Supérieure des Télécommunications --- Paris. http://www.enst.fr/_data/files/docs/id_515_1128675112_271.pdf W. E. H. Day (1981). The complexity of computing metric distances between partitions. Mathematical Social Sciences, 1, 269--287. E. Dimitriadou and A. Weingessel and K. Hornik (2002). A combination scheme for fuzzy clustering. International Journal of Pattern Recognition and Artificial Intelligence, 16, 901--912.

A. D. Gordon and M. Vichi (2001). Fuzzy partition models for fitting a set of partitions. Psychometrika, 66, 229--248.

D. Gusfield (2002). Partition-distance: A problem and class of perfect graphs arising in clustering. Information Processing Letters, 82, 159--164.

C. Rajski (1961). A metric space of discrete probability distributions, Information and Control, 4, 371--377.

J. Rubin, J (1967). Optimal classification into groups: An approach for solving the taxonomy problem. Journal of Theoretical Biology, 15, 103--144.

Examples

Run this code

## An ensemble of partitions.
data("CKME")
pens <- CKME[1 : 30]
diss <- cl_dissimilarity(pens)
summary(c(diss))
cl_dissimilarity(pens[1:5], pens[6:7])
## Equivalently, using subscripting.
diss[1:5, 6:7]
## Can use the dissimilarities for "secondary" clustering
## (e.g. obtaining hierarchies of partitions):
hc <- hclust(diss)
plot(hc)

## Example from Boorman and Arabie (1972).
P1 <- as.cl_partition(c(1, 2, 2, 2, 3, 3, 2, 2))
P2 <- as.cl_partition(c(1, 1, 2, 2, 3, 3, 4, 4))
cl_dissimilarity(P1, P2, "BA/A")
cl_dissimilarity(P1, P2, "BA/C")

Run the code above in your browser using DataLab