Learn R Programming

comparator (version 0.1.3)

MongeElkan: Monge-Elkan Token Comparator

Description

Compares a pair of token sets \(x\) and \(y\) by computing similarity scores between all pairs of tokens using an internal string comparator, then taking the mean of the maximum scores for each token in \(x\).

Usage

MongeElkan(
  inner_comparator = Levenshtein(similarity = TRUE, normalize = TRUE),
  agg_function = base::mean,
  symmetrize = FALSE
)

Value

A MongeElkan instance is returned, which is an S4 class inheriting from StringComparator.

Arguments

inner_comparator

internal string comparator of class StringComparator. Defaults to Levenshtein similarity.

agg_function

aggregation function to use when aggregating internal similarities/distances between tokens. Defaults to mean, however hmean may be a better choice when the comparator returns normalized similarity scores.

symmetrize

logical indicating whether to use a symmetrized version of the Monge-Elkan comparator. Defaults to FALSE.

Details

A token set is an unordered enumeration of tokens, which may include duplicates. Given two token sets \(x\) and \(y\), the Monge-Elkan comparator is defined as: $$\mathrm{ME}(x, y) = \frac{1}{|x|} \sum_{i = 1}^{|x|} \max_j \mathrm{sim}(x_i, y_j)$$ where \(x_i\) is the i-th token in \(x\), \(|x|\) is the number of tokens in \(x\) and \(\mathrm{sim}\) is an internal string similarity comparator.

A generalization of the original Monge-Elkan comparator is implemented here, which allows for distance comparators in place of similarity comparators, and/or more general aggregation functions in place of the arithmetic mean. The generalized Monge-Elkan comparator is defined as: $$\mathrm{ME}(x, y) = \mathrm{agg}(\mathrm{opt}_j \ \mathrm{inner}(x_i, y_j))$$ where \(\mathrm{inner}\) is an internal distance or similarity comparator, \(\mathrm{opt}\) is \(\max\) if \(\mathrm{inner}\) is a similarity comparator or \(\min\) if it is a distance comparator, and \(\mathrm{agg}\) is an aggregation function which takes a vector of scores for each token in \(x\) and returns a scalar.

By default, the Monge-Elkan comparator is asymmetric in its arguments \(x\) and \(y\). If symmetrize = TRUE, a symmetric version of the comparator is obtained as follows $$\mathrm{ME}_{sym}(x, y) = \mathrm{opt} \ \{\mathrm{ME}(x, y), \mathrm{ME}(y, x)\}$$ where \(\mathrm{opt}\) is defined above.

References

Monge, A. E., & Elkan, C. (1996), "The Field Matching Problem: Algorithms and Applications", In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 267-270.

Jimenez, S., Becerra, C., Gelbukh, A., & Gonzalez, F. (2009), "Generalized Monge-Elkan Method for Approximate Text String Comparison", In Computational Linguistics and Intelligent Text Processing, pp. 559-570.

Examples

Run this code
## Compare names with heterogenous representations
x <- "The University of California - San Diego"
y <- "Univ. Calif. San Diego"
# Tokenize strings on white space
x <- strsplit(x, '\\s+')
y <- strsplit(y, '\\s+')
MongeElkan()(x, y)

## The symmetrized variant is arguably more appropriate for this example
MongeElkan(symmetrize = TRUE)(x, y) 

## Using a different internal comparator changes the result
MongeElkan(inner_comparator = BinaryComp(), symmetrize=TRUE)(x, y)

Run the code above in your browser using DataLab