adist(x, y = NULL, costs = NULL, counts = FALSE, fixed = TRUE, partial = !fixed, ignore.case = FALSE, useBytes = FALSE)
NULL
(default) indicating
taking x
as y
.NULL
(default) indicating using unit cost for all three
possible transformations."counts"
attribute of the return
value.TRUE
(default), the x
elements are used as string literals. Otherwise, they are taken as
regular expressions and partial = TRUE
is implied
(corresponding to the approximate string distance used by
agrep
with fixed = FALSE
.x
elements must exactly match the complete y
elements, or only
substrings of these. The latter corresponds to the approximate
string distance used by agrep
(by default).TRUE
, case is ignored for
computing the distances.TRUE
distance computations are
done byte-by-byte rather than character-by-character.x
and y
, with rows and columns corresponding to x
and y
, respectively.If counts
is TRUE
, the transformation counts are
returned as the "counts"
attribute of this matrix, as a
3-dimensional array with dimensions corresponding to the elements of
x
, the elements of y
, and the type of transformation
(insertions, deletions and substitutions), respectively.
Additionally, if partial = FALSE
, the transformation sequences
are returned as the "trafos"
attribute of the return value, as
character strings with elements M, I, D and
S indicating a match, insertion, deletion and substitution,
respectively. If partial = TRUE
, the offsets (positions of
the first and last element) of the matched substrings are returned as
the "offsets"
attribute of the return value (with both offsets
$-1$ in case of no match).
partial = FALSE
, currently using
a dynamic programming algorithm (see, e.g.,
https://en.wikipedia.org/wiki/Levenshtein_distance) with space
and time complexity $O(mn)$, where $m$ and $n$ are the
lengths of s and t, respectively. Additionally computing
the transformation sequence and counts is $O(\max(m, n))$. The generalized Levenshtein distance can also be used for approximate
(fuzzy) string matching, in which case one finds the substring of
t with minimal distance to the pattern s (which could be
taken as a regular expression, in which case the principle of using
the leftmost and longest match applies), see, e.g.,
https://en.wikipedia.org/wiki/Approximate_string_matching. This
distance is computed for partial = TRUE
using tre by
Ville Laurikari (http://laurikari.net/tre/) and
corresponds to the distance used by agrep
. In this
case, the given cost values are coerced to integer.
Note that the costs for insertions and deletions can be different, in which case the distance between s and t can be different from the distance between t and s.
agrep
for approximate string matching (fuzzy matching)
using the generalized Levenshtein distance.
## Cf. https://en.wikipedia.org/wiki/Levenshtein_distance
adist("kitten", "sitting")
## To see the transformation counts for the Levenshtein distance:
drop(attr(adist("kitten", "sitting", counts = TRUE), "counts"))
## To see the transformation sequences:
attr(adist(c("kitten", "sitting"), counts = TRUE), "trafos")
## Cf. the examples for agrep:
adist("lasy", "1 lazy 2")
## For a "partial approximate match" (as used for agrep):
adist("lasy", "1 lazy 2", partial = TRUE)
Run the code above in your browser using DataLab