Compares a pair of strings/sequences x
and y
based on the number of
greedily-aligned characters/sequence elements and the number of
transpositions. It was developed for comparing names at the U.S. Census
Bureau.
Jaro(similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE)
A Jaro
instance is returned, which is an S4 class inheriting from
StringComparator
.
a logical. If TRUE, similarity scores are returned (default), otherwise distances are returned (see definition under Details).
a logical. If TRUE, case is ignored when comparing strings.
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.
For simplicity we assume x
and y
are strings in this section,
however the comparator is also implemented for more general sequences.
When similarity = TRUE
(default), the Jaro similarity is computed as
$$\mathrm{sim}(x, y) = \frac{1}{3}\left(\frac{m}{|x|} + \frac{m}{|y|} + \frac{m - \lfloor \frac{t}{2} \rfloor}{m}\right)$$
where \(m\) is the number of "matching" characters (defined below),
\(t\) is the number of "transpositions", and \(|x|,|y|\) are the
lengths of the strings \(x\) and \(y\). The similarity takes on values
in the range \([0, 1]\), where 1 corresponds to a perfect match.
The number of "matching" characters \(m\) is computed using a greedy alignment algorithm. The algorithm iterates over the characters in \(x\), attempting to align the \(i\)-th character \(x_i\) with the first matching character in \(y\). When looking for matching characters in \(y\), the algorithm only considers previously un-matched characters within a window \([\max(0, i - w), \min(|y|, i + w)]\) where \(w = \left\lfloor \frac{\max(|x|, |y|)}{2} \right\rfloor - 1\). The alignment process yields a subsequence of matching characters from \(x\) and \(y\). The number of "transpositions" \(t\) is defined to be the number of positions in the subsequence of \(x\) which are misaligned with the corresponding position in \(y\).
When similarity = FALSE
, the Jaro distance is computed as
$$\mathrm{dist}(x,y) = 1 - \mathrm{sim}(x,y).$$
Jaro, M. A. (1989), "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida", Journal of the American Statistical Association 84(406), 414-420.
The JaroWinkler
comparator modifies the Jaro
comparator by
boosting the similarity score for strings/sequences that have matching
prefixes.
## Compare names
Jaro()("Martha", "Mathra")
Jaro()("Eileen", "Phyllis")
Run the code above in your browser using DataLab