Learn R Programming

comparator (version 0.1.3)

OSA: Optimal String Alignment (OSA) String/Sequence Comparator

Description

The Optimal String Alignment (OSA) distance between two strings/sequences \(x\) and \(y\) is the minimum cost of operations (insertions, deletions, substitutions or transpositions) required to transform \(x\) into \(y\), subject to the constraint that no substring/subsequence is edited more than once.

Usage

OSA(
  deletion = 1,
  insertion = 1,
  substitution = 1,
  transposition = 1,
  normalize = FALSE,
  similarity = FALSE,
  ignore_case = FALSE,
  use_bytes = FALSE
)

Value

An OSA instance is returned, which is an S4 class inheriting from StringComparator.

Arguments

deletion

positive cost associated with deletion of a character or sequence element. Defaults to unit cost.

insertion

positive cost associated insertion of a character or sequence element. Defaults to unit cost.

substitution

positive cost associated with substitution of a character or sequence element. Defaults to unit cost.

transposition

positive cost associated with transposing (swapping) a pair of characters or sequence elements. Defaults to unit cost.

normalize

a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE.

similarity

a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE.

ignore_case

a logical. If TRUE, case is ignored when comparing strings.

use_bytes

a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Details

For simplicity we assume x and y are strings in this section, however the comparator is also implemented for more general sequences.

An OSA similarity is returned if similarity = TRUE, which is defined as $$\mathrm{sim}(x, y) = \frac{w_d |x| + w_i |y| - \mathrm{dist}(x, y)}{2},$$ where \(|x|\), \(|y|\) are the number of characters in \(x\) and \(y\) respectively, \(dist\) is the OSA distance, \(w_d\) is the cost of a deletion and \(w_i\) is the cost of an insertion.

Normalization of the OSA distance/similarity to the unit interval is also supported by setting normalize = TRUE. The normalization approach follows Yujian and Bo (2007), and ensures that the distance remains a metric when the costs of insertion \(w_i\) and deletion \(w_d\) are equal. The normalized distance \(\mathrm{dist}_n\) is defined as $$\mathrm{dist}_n(x, y) = \frac{2 \mathrm{dist}(x, y)}{w_d |x| + w_i |y| + \mathrm{dist}(x, y)},$$ and the normalized similarity \(\mathrm{sim}_n\) is defined as $$\mathrm{sim}_n(x, y) = 1 - \mathrm{dist}_n(x, y) = \frac{\mathrm{sim}(x, y)}{w_d |x| + w_i |y| - \mathrm{sim}(x, y)}.$$

References

Boytsov, L. (2011), "Indexing methods for approximate dictionary searching: Comparative analysis", ACM J. Exp. Algorithmics 16, Article 1.1.

Navarro, G. (2001), "A guided tour to approximate string matching", ACM Computing Surveys (CSUR), 33(1), 31-88.

Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29: 1091–1095.

See Also

Other edit-based comparators include Hamming, LCS, Levenshtein and DamerauLevenshtein.

Examples

Run this code
## Compare strings with a transposition error
x <- "plauge"; y <- "plague"
OSA()(x, y) != Levenshtein()(x, y)

## Unlike Damerau-Levenshtein, OSA does not allow a substring to be 
## edited more than once
x <- "ABC"; y <- "CA"
OSA()(x, y) != DamerauLevenshtein()(x, y)

## Compare car names using normalized OSA similarity
data(mtcars)
cars <- rownames(mtcars)
pairwise(OSA(similarity = TRUE, normalize=TRUE), cars)

Run the code above in your browser using DataLab