For simplicity we assume x
and y
are strings in this section,
however the comparator is also implemented for more general sequences.
A Levenshtein similarity is returned if similarity = TRUE
, which
is defined as
$$\mathrm{sim}(x, y) = \frac{w_d |x| + w_i |y| - \mathrm{dist}(x, y)}{2},$$
where \(|x|\), \(|y|\) are the number of characters in \(x\) and
\(y\) respectively, \(\mathrm{dist}\) is the Levenshtein distance,
\(w_d\) is the cost of a deletion and \(w_i\) is the cost of an
insertion.
Normalization of the Levenshtein distance/similarity to the unit interval
is also supported by setting normalize = TRUE
. The normalization approach
follows Yujian and Bo (2007), and ensures that the distance remains a metric
when the costs of insertion \(w_i\) and deletion \(w_d\) are equal.
The normalized distance \(\mathrm{dist}_n\) is defined as
$$\mathrm{dist}_n(x, y) = \frac{2 \mathrm{dist}(x, y)}{w_d |x| + w_i |y| + \mathrm{dist}(x, y)},$$
and the normalized similarity \(\mathrm{sim}_n\) is defined as
$$\mathrm{sim}_n(x, y) = 1 - \mathrm{dist}_n(x, y) = \frac{\mathrm{sim}(x, y)}{w_d |x| + w_i |y| - \mathrm{sim}(x, y)}.$$