FuzzyTokenSet: Fuzzy Token Set Comparator

Description

Compares a pair of token sets \(x\) and \(y\) by computing the optimal cost of transforming \(x\) into \(y\) using single-token operations (insertions, deletions and substitutions). The cost of single-token operations is determined at the character-level using an internal string comparator.

Usage

FuzzyTokenSet(
  inner_comparator = Levenshtein(normalize = TRUE),
  agg_function = base::mean,
  deletion = 1,
  insertion = 1,
  substitution = 1
)

Arguments

inner_comparator: inner string distance comparator of class StringComparator. Defaults to normalized Levenshtein distance.
agg_function: function used to aggregate the costs of the optimal operations. Defaults to base::mean.
deletion: non-negative weight associated with deletion of a token. Defaults to 1.
insertion: non-negative weight associated insertion of a token. Defaults to 1.
substitution: non-negative weight associated with substitution of a token. Defaults to 1.

Details

A token set is an unordered enumeration of tokens, which may include duplicates. Given two token sets \(x\) and \(y\), this comparator computes the optimal cost of transforming \(x\) into \(y\) using the following single-token operations:

deleting a token \(a\) from \(x\) at cost \(w_d \times \mathrm{inner}(a, "")\)
inserting a token \(b\) in \(y\) at cost \(w_i \times \mathrm{inner}("", b)\)
substituting a token \(a\) in \(x\) for a token \(b\) in \(y\) at cost \(w_s \times \mathrm{inner}(a, b)\)

where \(\mathrm{inner}\) is an internal string comparator and \(w_d, w_i, w_s\) are non-negative weights, referred to as deletion, insertion and substitution in the parameter list. By default, the mean cost of the optimal set of operations is returned. Other methods of aggregating the costs are supported by specifying a non-default agg_function.

If the internal string comparator is a distance function, then the optimal set of operations minimize the cost. Otherwise, the optimal set of operations maximize the cost. The optimization problem is solved exactly using a linear sum assignment solver.

Examples

Run this code

## Compare names with heterogenous representations
x <- "The University of California - San Diego"
y <- "Univ. Calif. San Diego"
# Tokenize strings on white space
x <- strsplit(x, '\\s+')
y <- strsplit(y, '\\s+')
FuzzyTokenSet()(x, y)
# Reduce the cost associated with missing words
FuzzyTokenSet(deletion = 0.5, insertion = 0.5)(x, y)

## Compare full name with abbreviated name, reducing the penalty 
## for dropping parts of the name
fullname <- "JOSE ELIAS TEJADA BASQUES"
name <- "JOSE BASQUES"
# Tokenize strings on white space
fullname <- strsplit(fullname, '\\s+')
name <- strsplit(name, '\\s+')
comparator <- FuzzyTokenSet(deletion = 0.5)
comparator(fullname, name) < comparator(name, fullname) # TRUE

Run the code above in your browser using DataLab