Hamming: Hamming String/Sequence Comparator

Description

The Hamming distance between two strings/sequences of equal length is the number of positions where the corresponding characters/sequence elements differ. It can be viewed as a type of edit distance where the only permitted operation is substitution of characters/sequence elements.

Usage

Hamming(
  normalize = FALSE,
  similarity = FALSE,
  ignore_case = FALSE,
  use_bytes = FALSE
)

Value

A Hamming instance is returned, which is an S4 class inheriting from StringComparator.

Arguments

normalize: a logical. If TRUE, distances/similarities are normalized to the unit interval. Defaults to FALSE.
similarity: a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE.
ignore_case: a logical. If TRUE, case is ignored when comparing strings.
use_bytes: a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Details

When the input strings/sequences $x$ and $y$ are of different lengths ($|x| \neq |y|$), the Hamming distance is defined to be $\infty$.

A Hamming similarity is returned if similarity = TRUE. When $|x| = |y|$ the similarity is defined as follows: $$\mathrm{sim}(x, y) = |x| - \mathrm{dist}(x, y),$$ where $sim$ is the Hamming similarity and $dist$ is the Hamming distance. When $|x| \neq |y|$ the similarity is defined to be 0.

Normalization of the Hamming distance/similarity to the unit interval is also supported by setting normalize = TRUE. The raw distance/similarity is divided by the length of the string/sequence $|x| = |y|$. If $|x| \neq |y|$ the normalized distance is defined to be 1, while the normalized similarity is defined to be 0.

Examples

Run this code

## Compare US ZIP codes
x <- "90001"
y <- "90209"
m1 <- Hamming()                                     # unnormalized distance
m2 <- Hamming(similarity = TRUE, normalize = TRUE)  # normalized similarity
m1(x, y)
m2(x, y)