
Approximate string matching equivalents of R
's native
match
and %in%
.
amatch(x, table, nomatch = NA_integer_, matchNA = TRUE, method = c("osa",
"lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"),
useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = 0.1,
q = 1, p = 0, bt = 0, nthread = getOption("sd_num_thread"))ain(x, table, ...)
elements to be approximately matched: will be coerced to
character
unless it is a list consisting of integer
vectors.
lookup table for matching. Will be coerced to character
unless it is a list consting of integer
vectors.
The value to be returned when no match is found. This is coerced to integer.
Should NA
's be matched? Default behaviour mimics the
behaviour of base match
, meaning that NA
matches
NA
(see also the note on NA
handling below).
Matching algorithm to use. See stringdist-metrics
.
Perform byte-wise comparison. See stringdist-encoding
.
For method='osa'
or 'dl'
, the penalty for
deletion, insertion, substitution and transposition, in that order. When
method='lv'
, the penalty for transposition is ignored. When
method='jw'
, the weights associated with characters of a
,
characters from b
and the transposition weight, in that order.
Weights must be positive and not exceed 1. weight
is ignored
completely when method='hamming'
, 'qgram'
, 'cosine'
,
'Jaccard'
, 'lcs'
, or 'soundex'
.
Elements in x
will not be matched with elements of
table
if their distance is larger than maxDist
. Note that the
maximum distance between strings depends on the method: it should always be
specified.
q-gram size, only when method is 'qgram'
, 'jaccard'
,
or 'cosine'
.
Winklers penalty parameter for Jaro-Winkler distance, with
'jw'
Winkler's boost threshold. Winkler's penalty factor is
only applied when the Jaro distance is larger than bt
.
Applies only to method='jw'
and p>0
.
Number of threads used by the underlying C-code. A sensible
default is chosen, see stringdist-parallelization
.
parameters to pass to amatch
(except nomatch
)
amatch
returns the position of the closest match of x
in table
. When multiple matches with the same smallest distance
metric exist, the first one is returned. ain
returns a
logical
vector of length length(x)
indicating wether an
element of x
approximately matches an element in table
.
R
's native match
function matches NA
with
NA
. This may feel inconsistent with R
's usual NA
handling, since for example NA==NA
yields
NA
rather than TRUE
. In most cases, one may reason about the
behaviour under NA
along the lines of ``if one of the arguments is
NA
, the result shall be NA
'', simply because not all
information necessary to execute the function is available. One uses special
functions such as is.na
, is.null
etc. to handle special
values.
The amatch
function mimics the behaviour of match
by default: NA
is matched with NA
and with nothing else. Note
that this is inconsistent with the behaviour of stringdist
since stringdist
yields NA
when at least one of the arguments
is NA
. The same inconsistency exists between match
and adist
. In amatch
this behaviour can be
controlled by setting matchNA=FALSE
. In that case, if any of the
arguments in x
is NA
, the nomatch
value is returned,
regardless of whether NA
is present in table
. In
match
the behaviour can be controlled by setting the
incomparables
option.
ain
is currently defined as
ain(x,table,...) <- function(x,table,...) amatch(x, table, nomatch=0,...) > 0
# NOT RUN {
# lets see which sci-fi heroes are stringdistantly nearest
amatch("leia",c("uhura","leela"),maxDist=5)
# we can restrict the search
amatch("leia",c("uhura","leela"),maxDist=1)
# we can match each value in the find vector against values in the lookup table:
amatch(c("leia","uhura"),c("ripley","leela","scully","trinity"),maxDist=2)
# setting nomatch returns a different value when no match is found
amatch("leia",c("uhura","leela"),maxDist=1,nomatch=0)
# this is always true if maxDist is Inf
ain("leia",c("uhura","leela"),maxDist=Inf)
# Let's look in a neighbourhood of maximum 2 typo's (by default, the OSA algorithm is used)
ain("leia",c("uhura","leela"), maxDist=2)
# }
Run the code above in your browser using DataLab