adist(x, y = NULL, costs = NULL, counts = FALSE, fixed = TRUE,
partial = !fixed, ignore.case = FALSE, useBytes = FALSE)
NULL
(default) indicating
taking x
as y
.NULL
(default) indicating using unit cost for all three
possible transformations."counts"
attribute of the return
value.TRUE
(default), the x
elements are used as string literals. Otherwise, they are taken as
regular expressions and partial = TRUE
is implied
(corresponding to the approximate string distance used by
agrep
with fixed = FALSE
.x
elements must exactly match the complete y
elements, or only
substrings of these. The latter corresponds to the approximate
string distance used by agrep
(by default).TRUE
, case is ignored for
computing the distances.TRUE
distance computations are
done byte-by-byte rather than character-by-character.x
and y
, with rows and columns corresponding to x
and y
, respectively. If counts
is TRUE
, the transformation counts are
returned as the "counts"
attribute of this matrix, as a
3-dimensional array with dimensions corresponding to the elements of
x
, the elements of y
, and the type of transformation
(insertions, deletions and substitutions), respectively.
Additionally, if partial = FALSE
, the transformation sequences
are returned as the "trafos"
attribute of the return value, as
character strings with elements M, I, D and
S indicating a match, insertion, deletion and substitution,
respectively. If partial = TRUE
, the offsets (positions of
the first and last element) of the matched substrings are returned as
the "offsets"
attribute of the return value (with both offsets
\(-1\) in case of no match).partial = FALSE
, currently using
a dynamic programming algorithm (see, e.g.,
https://en.wikipedia.org/wiki/Levenshtein_distance) with space
and time complexity \(O(mn)\), where \(m\) and \(n\) are the
lengths of s and t, respectively. Additionally computing
the transformation sequence and counts is \(O(\max(m, n))\). The generalized Levenshtein distance can also be used for approximate
(fuzzy) string matching, in which case one finds the substring of
t with minimal distance to the pattern s (which could be
taken as a regular expression, in which case the principle of using
the leftmost and longest match applies), see, e.g.,
https://en.wikipedia.org/wiki/Approximate_string_matching. This
distance is computed for partial = TRUE
using tre by
Ville Laurikari (http://laurikari.net/tre/) and
corresponds to the distance used by agrep
. In this
case, the given cost values are coerced to integer. Note that the costs for insertions and deletions can be different, in
which case the distance between s and t can be different
from the distance between t and s.agrep
for approximate string matching (fuzzy matching)
using the generalized Levenshtein distance.## Cf. https://en.wikipedia.org/wiki/Levenshtein_distance
adist("kitten", "sitting")
## To see the transformation counts for the Levenshtein distance:
drop(attr(adist("kitten", "sitting", counts = TRUE), "counts"))
## To see the transformation sequences:
attr(adist(c("kitten", "sitting"), counts = TRUE), "trafos")
## Cf. the examples for agrep:
adist("lasy", "1 lazy 2")
## For a "partial approximate match" (as used for agrep):
adist("lasy", "1 lazy 2", partial = TRUE)
Run the code above in your browser using DataLab