A token set is an unordered enumeration of tokens, which may include
duplicates.
Given two token sets \(x\) and \(y\), the Monge-Elkan comparator is
defined as:
$$\mathrm{ME}(x, y) = \frac{1}{|x|} \sum_{i = 1}^{|x|} \max_j \mathrm{sim}(x_i, y_j)$$
where \(x_i\) is the i-th token in \(x\), \(|x|\) is the
number of tokens in \(x\) and \(\mathrm{sim}\) is an internal
string similarity comparator.
A generalization of the original Monge-Elkan comparator is implemented here,
which allows for distance comparators in place of similarity comparators,
and/or more general aggregation functions in place of the arithmetic mean.
The generalized Monge-Elkan comparator is defined as:
$$\mathrm{ME}(x, y) = \mathrm{agg}(\mathrm{opt}_j \ \mathrm{inner}(x_i, y_j))$$
where \(\mathrm{inner}\) is an internal distance or similarity
comparator, \(\mathrm{opt}\) is \(\max\) if
\(\mathrm{inner}\) is a similarity comparator or \(\min\) if
it is a distance comparator, and \(\mathrm{agg}\) is an aggregation
function which takes a vector of scores for each token in \(x\) and
returns a scalar.
By default, the Monge-Elkan comparator is asymmetric in its arguments \(x\)
and \(y\). If symmetrize = TRUE
, a symmetric version of the comparator
is obtained as follows
$$\mathrm{ME}_{sym}(x, y) = \mathrm{opt} \ \{\mathrm{ME}(x, y), \mathrm{ME}(y, x)\}$$
where \(\mathrm{opt}\) is defined above.