ls-fit-ultrametric: Least Squares Fit of Ultrametrics to Dissimilarities

Description

Find the ultrametric minimizing least squares distance (euclidean dissimilarity) to a given dissimilarity object.

Usage

ls_fit_ultrametric(x, control = list())

Arguments

a dissimilarity object inheriting from class "dist".

control

a list of control parameters. See Details.

Value

An object of class "cl_ultrametric" containing the optimal ultrametric distances.

Details

With $L(u) = \sum (x - u)^2$, the problem to be solved is minimizing $L$ over all $u$ satisfying the ultrametric constraints (i.e., for all $i, j, k$, $u_{ij} \le \max(u_{ik}, u_{jk})$). This problem is known to be NP hard (Krivanek and Moravek, 1986).

We follow de Soete (1986) to use a heuristic based on an SUMT (Sequential Unconstrained Minimization Technique) approach in turn simplifying the suggestions in Carroll and Pruzansky (1980). One iteratively minimizes $L(u) + \rho_k P(u)$, where $P(u)$ is a non-negative function penalizing violations of the ultrametric constraints such that $P(u)$ is zero iff $u$ is an ultrametric. The $\rho$ values are increased according to the rule $\rho_{k+1} = q \rho_k$ for some constant $q > 1$, until convergence is obtained in the sense that the euclidean distance between successive solutions $u_k$ and $u_{k+1}$ is small enough. We then use a final rounding step to ensure that the returned object exactly satisfies the ultrametric constraints. The starting value $u_0$ is obtained by random shaking of the given dissimilarity object.

The unconstrained minimizations are carried out using either optim or nlm, using the analytic gradients given in Carroll and Pruzansky (1980). The following control parameters can be provided via the control argument.

[object Object],[object Object],[object Object],[object Object],[object Object]

The default optimization using conjugate gradients should work reasonably well for medium to large size problems. For small ones, using nlm is usually faster. Note that the number of ultrametric constraints is of the order $n^3$, where $n$ is the number of objects in the dissimilarity object, suggesting to use the SUMT approach in favor of constrOptim.

It should be noted that the SUMT approach is a heuristic which can not be guaranteed to find the global minimum. Standard practice would recommend to use the best solution found in sufficiently many replications of the base algorithm.

References

J. D. Carroll and S. Pruzansky (1980). Discrete and hybrid scaling models. In E. D. Lantermann and H. Feger (eds.), Similarity and Choice. Bern (Switzerland): Huber. M. Krivanek and J. Moravek (1986). NP-hard problems in hierarchical tree clustering. Acta Informatica, 23, 311--323.

G. de Soete (1986). A least squares algorithm for fitting an ultrametric tree to a dissimilarity matrix. Pattern Recognition Letters, 2, 133--137.

Examples

Run this code

## Least squares fit of an ultrametric to the Miller-Nicely consonant
## phoneme confusion data.
data("Phonemes")
## Note that the Phonemes data set has the consonant misclassification
## probabilities, i.e., the similarities between the phonemes.
d <- 1 - as.dist(Phonemes)
u <- ls_fit_ultrametric(d, control = list(verbose = TRUE))
## Cophenetic correlation:
cor(d, u)
## Dendrogram:
plot(u)
## ("Basically" the same as Figure 1 in de Soete (1986).)

Run the code above in your browser using DataLab