InVocabulary: In-Vocabulary Comparator

Description

Compares a pair of strings \(x\) and \(y\) using a reference vocabulary. Different scores are returned depending on whether both/one/neither of \(x\) and \(y\) are in the reference vocabulary.

Usage

InVocabulary(
  vocab,
  both_in_distinct = 0.7,
  both_in_same = 1,
  one_in = 1,
  none_in = 1,
  ignore_case = FALSE
)

Value

An InVocabulary instance is returned, which is an S4 class inheriting from StringComparator.

Arguments

vocab: a vector containing in-vocabulary (known) strings. Any strings not in this vector are out-of-vocabulary (unknown).
both_in_distinct: score to return if the pair of values being compared are both in vocab and distinct. Defaults to 0.7, which would is appropriate for multiplying by similarity scores. If multiplying by distance scores, a value greater than 1 is likely to be more appropriate.
both_in_same: score to return if the pair of values being compared are both in vocab and identical. Defaults to 1.0, which would leave another score unchanged when multiplied by this one.
one_in: score to return if only one of the pair of values being compared is in vocab. Defaults to 1.0, which would leave another score unchanged when multiplied by this one.
none_in: score to return if none of the pair of values being compared is in vocab. Defaults to 1.0, which would leave another score unchanged when multiplied by this one.
ignore_case: a logical. If TRUE, case is ignored when comparing the strings.

Details

This comparator is not intended to produce useful scores on its own. Rather, it is intended to produce multiplicative factors which can be applied to other similarity/distance scores. It is particularly useful for comparing names when a reference list (vocabulary) of known names is available. For example, it can be configured to down-weight the similarity scores of distinct (known) names like "Roberto" and "Umberto" which are semantically very different, but deceptively similar in terms of edit distance. The normalized Levenshtein similarity for these two names is 75%, but their similarity can be reduced to 53% if multiplied by the score from this comparator using the default settings.

Examples

Run this code

## Compare names with possible typos using a reference of known names
known_names <- c("Roberto", "Umberto", "Alberto", "Emberto", "Norberto", "Humberto")
m1 <- InVocabulary(known_names)
m2 <- Levenshtein(similarity = TRUE, normalize = TRUE)
x <- "Emberto"
y <- c("Enberto", "Umberto")
# "Emberto" and "Umberto" are likely to refer to distinct people (since 
# they are known distinct names) so their Levenshtein similarity is 
# downweighted to 0.61. "Emberto" and "Enberto" may refer to the same 
# person (likely typo), so their Levenshtein similarity of 0.87 is not 
# downweighted.
similarities <- m1(x, y) * m2(x, y)

Run the code above in your browser using DataLab