MongeElkan: Monge-Elkan Token Comparator

Description

Compares a pair of token sets $x$ and $y$ by computing similarity scores between all pairs of tokens using an internal string comparator, then taking the mean of the maximum scores for each token in $x$.

Usage

MongeElkan(
  inner_comparator = Levenshtein(similarity = TRUE, normalize = TRUE),
  agg_function = base::mean,
  symmetrize = FALSE
)

Value

A MongeElkan instance is returned, which is an S4 class inheriting from StringComparator.

Arguments

inner_comparator: internal string comparator of class StringComparator. Defaults to Levenshtein similarity.
agg_function: aggregation function to use when aggregating internal similarities/distances between tokens. Defaults to mean, however hmean may be a better choice when the comparator returns normalized similarity scores.
symmetrize: logical indicating whether to use a symmetrized version of the Monge-Elkan comparator. Defaults to FALSE.

Details

A token set is an unordered enumeration of tokens, which may include duplicates. Given two token sets $x$ and $y$, the Monge-Elkan comparator is defined as: $$\mathrm{ME}(x, y) = \frac{1}{|x|} \sum_{i = 1}^{|x|} \max_j \mathrm{sim}(x_i, y_j)$$ where $x_i$ is the i-th token in $x$, $|x|$ is the number of tokens in $x$ and $\mathrm{sim}$ is an internal string similarity comparator.

A generalization of the original Monge-Elkan comparator is implemented here, which allows for distance comparators in place of similarity comparators, and/or more general aggregation functions in place of the arithmetic mean. The generalized Monge-Elkan comparator is defined as: $$\mathrm{ME}(x, y) = \mathrm{agg}(\mathrm{opt}_j \ \mathrm{inner}(x_i, y_j))$$ where $\mathrm{inner}$ is an internal distance or similarity comparator, $\mathrm{opt}$ is $\max$ if $\mathrm{inner}$ is a similarity comparator or $\min$ if it is a distance comparator, and $\mathrm{agg}$ is an aggregation function which takes a vector of scores for each token in $x$ and returns a scalar.

By default, the Monge-Elkan comparator is asymmetric in its arguments $x$ and $y$. If symmetrize = TRUE, a symmetric version of the comparator is obtained as follows $$\mathrm{ME}_{sym}(x, y) = \mathrm{opt} \ \{\mathrm{ME}(x, y), \mathrm{ME}(y, x)\}$$ where $\mathrm{opt}$ is defined above.

References

Monge, A. E., & Elkan, C. (1996), "The Field Matching Problem: Algorithms and Applications", In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 267-270.

Jimenez, S., Becerra, C., Gelbukh, A., & Gonzalez, F. (2009), "Generalized Monge-Elkan Method for Approximate Text String Comparison", In Computational Linguistics and Intelligent Text Processing, pp. 559-570.

Examples

Run this code

## Compare names with heterogenous representations
x <- "The University of California - San Diego"
y <- "Univ. Calif. San Diego"
# Tokenize strings on white space
x <- strsplit(x, '\\s+')
y <- strsplit(y, '\\s+')
MongeElkan()(x, y)

## The symmetrized variant is arguably more appropriate for this example
MongeElkan(symmetrize = TRUE)(x, y) 

## Using a different internal comparator changes the result
MongeElkan(inner_comparator = BinaryComp(), symmetrize=TRUE)(x, y)

Run the code above in your browser using DataLab