Learn R Programming

comparator (version 0.1.3)

Hamming: Hamming String/Sequence Comparator

Description

The Hamming distance between two strings/sequences of equal length is the number of positions where the corresponding characters/sequence elements differ. It can be viewed as a type of edit distance where the only permitted operation is substitution of characters/sequence elements.

Usage

Hamming(
  normalize = FALSE,
  similarity = FALSE,
  ignore_case = FALSE,
  use_bytes = FALSE
)

Value

A Hamming instance is returned, which is an S4 class inheriting from StringComparator.

Arguments

normalize

a logical. If TRUE, distances/similarities are normalized to the unit interval. Defaults to FALSE.

similarity

a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE.

ignore_case

a logical. If TRUE, case is ignored when comparing strings.

use_bytes

a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Details

When the input strings/sequences \(x\) and \(y\) are of different lengths (\(|x| \neq |y|\)), the Hamming distance is defined to be \(\infty\).

A Hamming similarity is returned if similarity = TRUE. When \(|x| = |y|\) the similarity is defined as follows: $$\mathrm{sim}(x, y) = |x| - \mathrm{dist}(x, y),$$ where \(sim\) is the Hamming similarity and \(dist\) is the Hamming distance. When \(|x| \neq |y|\) the similarity is defined to be 0.

Normalization of the Hamming distance/similarity to the unit interval is also supported by setting normalize = TRUE. The raw distance/similarity is divided by the length of the string/sequence \(|x| = |y|\). If \(|x| \neq |y|\) the normalized distance is defined to be 1, while the normalized similarity is defined to be 0.

See Also

Other edit-based comparators include LCS, Levenshtein, OSA and DamerauLevenshtein.

Examples

Run this code
## Compare US ZIP codes
x <- "90001"
y <- "90209"
m1 <- Hamming()                                     # unnormalized distance
m2 <- Hamming(similarity = TRUE, normalize = TRUE)  # normalized similarity
m1(x, y)
m2(x, y)

Run the code above in your browser using DataLab