
Compute the consensus clustering of an ensemble of partitions or hierarchies.
cl_consensus(x, method = NULL, weights = 1, control = list())
The consensus partition or hierarchy.
an ensemble of partitions or hierarchies, or something
coercible to that (see cl_ensemble
).
a character string specifying one of the built-in
methods for computing consensus clusterings, or a function to be
taken as a user-defined method, or NULL
(default value). If
a character string, its lower-cased version is matched against the
lower-cased names of the available built-in methods using
pmatch
. See Details for available built-in
methods and defaults.
a numeric vector with non-negative case weights.
Recycled to the number of elements in the ensemble given by x
if necessary.
a list of control parameters. See Details.
Consensus clusterings “synthesize” the information in the elements of a cluster ensemble into a single clustering, often by minimizing a criterion function measuring how dissimilar consensus candidates are from the (elements of) the ensemble (the so-called “optimization approach” to consensus clustering).
The most popular criterion functions are of the form cl_dissimilarity
), cl_medoid
). For
If all elements of the ensemble are partitions, the built-in consensus
methods compute consensus partitions by minimizing a criterion of the
form
"SE"
a fixed-point algorithm for obtaining soft
least squares Euclidean consensus partitions (i.e., for minimizing
This iterates between individually matching all partitions to the current approximation to the consensus partition, and computing the next approximation as the membership matrix closest to a suitable weighted average of the memberships of all partitions after permuting their columns for the optimal matchings of class ids.
The following control parameters are available for this method.
k
an integer giving the number of classes to be used for the least squares consensus partition. By default, the maximal number of classes in the ensemble is used.
maxiter
an integer giving the maximal number of iterations to be performed. Defaults to 100.
nruns
an integer giving the number of runs to be performed. Defaults to 1.
reltol
the relative convergence tolerance.
Defaults to sqrt(.Machine$double.eps)
.
start
a matrix with number of rows equal to the
number of objects of the cluster ensemble, and
verbose
a logical indicating whether to provide
some output on minimization progress.
Defaults to getOption("verbose")
.
In the case of multiple runs, the first optimum found is returned.
This method can also be referred to as "soft/euclidean"
.
"GV1"
the fixed-point algorithm for the “first
model” in Gordon and Vichi (2001) for minimizing
This is similar to "SE"
, but uses GV1 rather than Euclidean
dissimilarity.
Available control parameters are the same as for "SE"
.
"DWH"
an extension of the greedy algorithm in
Dimitriadou, Weingessel and Hornik (2002) for (approximately)
obtaining soft least squares Euclidean consensus partitions.
The reference provides some structure theory relating finding
the consensus partition to an instance of the multiple assignment
problem, which is known to be NP-hard, and suggests a simple
heuristic based on successively matching an individual partition
The following control parameters are available for this method.
k
an integer giving the number of classes to be used for the least squares consensus partition. By default, the maximal number of classes in the ensemble is used.
order
a permutation of the integers from 1 to the size of the ensemble, specifying the order in which the partitions in the ensemble should be aggregated. Defaults to using a random permutation (unlike the reference, which does not permute at all).
"HE"
a fixed-point algorithm for obtaining hard
least squares Euclidean consensus partitions (i.e., for minimizing
Available control parameters are the same as for "SE"
.
This method can also be referred to as "hard/euclidean"
.
"SM"
a fixed-point algorithm for obtaining soft
median Manhattan consensus partitions (i.e., for minimizing
Available control parameters are the same as for "SE"
.
This method can also be referred to as "soft/manhattan"
.
"HM"
a fixed-point algorithm for obtaining hard
median Manhattan consensus partitions (i.e., for minimizing
Available control parameters are the same as for "SE"
.
This method can also be referred to as "hard/manhattan"
.
"GV3"
a SUMT algorithm for the “third
model” in Gordon and Vichi (2001) for minimizing sumt
for more information on the SUMT
approach.) This optimization problem is equivalent to finding the
membership matrix
Available control parameters are method
, control
,
eps
, q
, and verbose
, which have the same
roles as for sumt
, and the following.
k
an integer giving the number of classes to be used for the least squares consensus partition. By default, the maximal number of classes in the ensemble is used.
nruns
an integer giving the number of runs to be performed. Defaults to 1.
start
a matrix with number of rows equal to the
size of the cluster ensemble, and
In the case of multiple runs, the first optimum found is returned.
"soft/symdiff"
a SUMT approach for
minimizing
Available control parameters are the same as for "GV3"
.
"hard/symdiff"
an exact solver for minimizing
k
), where
By default, method "SE"
is used for ensembles of partitions.
If all elements of the ensemble are hierarchies, the following built-in methods for computing consensus hierarchies are available.
"euclidean"
an algorithm for minimizing
ls_fit_ultrametric
on
This method can also be referred to as "cophenetic"
.
"manhattan"
a SUMT for minimizing
Available control parameters are the same as for
"euclidean"
.
"majority"
a hierarchy obtained from an extension of
the majority consensus tree of Margush and McMorris (1981), which
minimizes
The fraction p
.
By default, method "euclidean"
is used for ensembles of
hierarchies.
If a user-defined consensus method is to be employed, it must be a
function taking the cluster ensemble, the case weights, and a list of
control parameters as its arguments, with formals named x
,
weights
, and control
, respectively.
Most built-in methods use heuristics for solving hard optimization problems, and cannot be guaranteed to find a global minimum. Standard practice would recommend to use the best solution found in “sufficiently many” replications of the methods.
E. Dimitriadou, A. Weingessel and K. Hornik (2002).
A combination scheme for fuzzy clustering.
International Journal of Pattern Recognition and Artificial
Intelligence, 16, 901--912.
tools:::Rd_expr_doi("10.1142/S0218001402002052").
A. D. Gordon and M. Vichi (2001). Fuzzy partition models for fitting a set of partitions. Psychometrika, 66, 229--248. tools:::Rd_expr_doi("10.1007/BF02294837").
A. D. Gordon (1999). Classification (2nd edition). Boca Raton, FL: Chapman & Hall/CRC.
T. Margush and F. R. McMorris (1981).
Consensus
cl_medoid
,
consensus
## Consensus partition for the Rosenberg-Kim kinship terms partition
## data based on co-membership dissimilarities.
data("Kinship82")
m1 <- cl_consensus(Kinship82, method = "GV3",
control = list(k = 3, verbose = TRUE))
## (Note that one should really use several replicates of this.)
## Value for criterion function to be minimized:
sum(cl_dissimilarity(Kinship82, m1, "comem") ^ 2)
## Compare to the consensus solution given in Gordon & Vichi (2001).
data("Kinship82_Consensus")
m2 <- Kinship82_Consensus[["JMF"]]
sum(cl_dissimilarity(Kinship82, m2, "comem") ^ 2)
## Seems we get a better solution ...
## How dissimilar are these solutions?
cl_dissimilarity(m1, m2, "comem")
## How "fuzzy" are they?
cl_fuzziness(cl_ensemble(m1, m2))
## Do the "nearest" hard partitions fully agree?
cl_dissimilarity(as.cl_hard_partition(m1),
as.cl_hard_partition(m2))
## Consensus partition for the Gordon and Vichi (2001) macroeconomic
## partition data based on Euclidean dissimilarities.
data("GVME")
set.seed(1)
## First, using k = 2 classes.
m1 <- cl_consensus(GVME, method = "GV1",
control = list(k = 2, verbose = TRUE))
## (Note that one should really use several replicates of this.)
## Value of criterion function to be minimized:
sum(cl_dissimilarity(GVME, m1, "GV1") ^ 2)
## Compare to the consensus solution given in Gordon & Vichi (2001).
data("GVME_Consensus")
m2 <- GVME_Consensus[["MF1/2"]]
sum(cl_dissimilarity(GVME, m2, "GV1") ^ 2)
## Seems we get a slightly better solution ...
## But note that
cl_dissimilarity(m1, m2, "GV1")
## and that the maximal deviation of the memberships is
max(abs(cl_membership(m1) - cl_membership(m2)))
## so the differences seem to be due to rounding.
## Do the "nearest" hard partitions fully agree?
table(cl_class_ids(m1), cl_class_ids(m2))
## And now for k = 3 classes.
m1 <- cl_consensus(GVME, method = "GV1",
control = list(k = 3, verbose = TRUE))
sum(cl_dissimilarity(GVME, m1, "GV1") ^ 2)
## Compare to the consensus solution given in Gordon & Vichi (2001).
m2 <- GVME_Consensus[["MF1/3"]]
sum(cl_dissimilarity(GVME, m2, "GV1") ^ 2)
## This time we look much better ...
## How dissimilar are these solutions?
cl_dissimilarity(m1, m2, "GV1")
## Do the "nearest" hard partitions fully agree?
table(cl_class_ids(m1), cl_class_ids(m2))
Run the code above in your browser using DataLab