Learn R Programming

stringdist

  • Approximate matching and string distance calculations for R.
  • All distance and matching operations are system- and encoding-independent.
  • Built for speed, using openMP for parallel computing.

The package offers the following main functions:

  • stringdist computes pairwise distances between two input character vectors (shorter one is recycled)
  • stringdistmatrix computes the distance matrix for one or two vectors
  • stringsim computes a string similarity between 0 and 1, based on stringdist
  • amatch is a fuzzy matching equivalent of R's native match function
  • ain is a fuzzy matching equivalent of R's native %in% operator
  • seq_dist, seq_distmatrix, seq_amatch and seq_ain for distances between, and matching of integer sequences.

These functions are built upon C-code that re-implements some common (weighted) string distance functions. Distance functions include:

  • Hamming distance;
  • Levenshtein distance (weighted)
  • Restricted Damerau-Levenshtein distance (weighted, a.k.a. Optimal String Alignment)
  • Full Damerau-Levenshtein distance
  • Longest Common Substring distance
  • Q-gram distance
  • cosine distance for q-gram count vectors (= 1-cosine similarity)
  • Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)
  • Jaro, and Jaro-Winkler distance
  • Soundex-based string distance

Also, there are some utility functions:

  • qgrams() tabulates the qgrams in one or more character vectors.
  • seq_qrams() tabulates the qgrams (somtimes called ngrams) in one or more integer vectors.
  • phonetic() computes phonetic codes of strings (currently only soundex)
  • printable_ascii() is a utility function that detects non-printable ascii or non-ascii characters.

C API

Some of stringdist's underlying C functions can be called directly from C code in other packages. The description of the API can be found by either typing ?stringdist_api in the R console or open the vignette directly as follows:

vignette("stringdist_C-Cpp_api", package="stringdist")

Examples of packages that link to stringdist can be found here and here.

Resources

  • A paper on stringdist has been published in the R-journal
  • Slides of a talk given at te useR!2014 conference.

Copy Link

Version

Install

install.packages('stringdist')

Monthly Downloads

45,027

Version

0.9.14

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

December 10th, 2024

Functions in stringdist (0.9.14)

stringdist-package

A package for string distance calculation and approximate string matching.
stringdist-metrics

String metrics in stringdist
stringdist-parallelization

Multithreading and parallelization in stringdist
stringdist

Compute distance metrics between strings
stringsim

Compute similarity scores between strings
stringdist-encoding

String metrics in stringdist
afind

Stringdist-based fuzzy text search
stringdist_api

Calling stringdist from C or C++
seq_sim

Compute similarity scores between sequences of integers
seq_qgrams

Get a table of qgram counts for integer sequences
printable_ascii

Detect the presence of non-printable or non-ascii characters
qgrams

Get a table of qgram counts from one or more character vectors.
amatch

Approximate string matching
phonetic

Phonetic algorithms
seq_dist

Compute distance metrics between integer sequences
seq_amatch

Approximate matching for integer sequences.