afind
slides a window of fixed width over a string x
and
computes the distance between the each window and the sought-after
pattern
. The location, content, and distance corresponding to the
window with the best match is returned.
afind(
x,
pattern,
window = NULL,
value = TRUE,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "running_cosine",
"jaccard", "jw", "soundex"),
useBytes = FALSE,
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
nthread = getOption("sd_num_thread")
)grab(x, pattern, maxDist = Inf, value = FALSE, ...)
grabl(x, pattern, maxDist = Inf, ...)
extract(x, pattern, maxDist = Inf, ...)
strings to search in
strings to find (not a regular expression). For grab
,
grabl
, and extract
this must be a single string.
width of moving window.
toggle return matrix with matched strings.
Matching algorithm to use. See stringdist-metrics
.
Perform byte-wise comparison. See stringdist-encoding
.
For method='osa'
or 'dl'
, the penalty for
deletion, insertion, substitution and transposition, in that order. When
method='lv'
, the penalty for transposition is ignored. When
method='jw'
, the weights associated with characters of a
,
characters from b
and the transposition weight, in that order.
Weights must be positive and not exceed 1. weight
is ignored
completely when method='hamming'
, 'qgram'
, 'cosine'
,
'Jaccard'
, 'lcs'
, or 'soundex'
.
q-gram size, only when method is 'qgram'
, 'jaccard'
,
or 'cosine'
.
Winklers 'prefix' parameter for Jaro-Winkler distance, with
\(0\leq p\leq0.25\). Only when method is 'jw'
Winkler's boost threshold. Winkler's prefix factor is
only applied when the Jaro distance is larger than bt
.
Applies only to method='jw'
and p>0
.
Number of threads used by the underlying C-code. A sensible
default is chosen, see stringdist-parallelization
.
Only windows with distance <= maxDist
are considered a match.
passed to afind
.
For afind
: a list
of three matrices, each with
length(x)
rows and length(pattern)
columns. In each matrix,
element \((i,j)\) corresponds to x[i]
and pattern[j]
. The
names and description of each matrix is as follows.
location
. [integer]
, location of the start of best matching window.
When useBytes=FALSE
, this corresponds to the location of a UTF
code point
in x
, possibly after conversion from its original encoding.
distance
. [character]
, the string distance between pattern and
the best matching window.
match
. [character]
, the first, best matching window.
For grab
, an integer
vector, indicating in which elements of
x
a match was found with a distance <= maxDist
. The matched
values when value=TRUE
(equivalent to grep
).
For grabl
, a logical
vector, indicating in which elements of
x
a match was found with a distance <= maxDist
. (equivalent
to grepl
).
For extract
, a character
matrix with length(x)
rows and
length(pattern)
columns. If match was found, element \((i,j)\)
contains the match, otherwise it is set to NA
.
This algorithm gains efficiency by using that two consecutive windows have
a large overlap in their q-gram profiles. It gives the same result as
the "cosine"
distance, but much faster.
Matching is case-sensitive. Both x
and pattern
are converted
to UTF-8
prior to search, unless useBytes=TRUE
, in which case
the distances are measured bytewise.
Code is parallelized over the x
variable: each value of x
is scanned for every element in pattern
using a separate thread (when nthread
is larger than 1).
The functions grab
and grabl
are approximate string matching
functions that somewhat resemble base R's grep
and
grepl
. They are implemented as convenience wrappers
of afind
.
Other matching:
amatch()
# NOT RUN {
texts = c("When I grow up, I want to be"
, "one of the harvesters of the sea"
, "I think before my days are gone"
, "I want to be a fisherman")
patterns = c("fish", "gone","to be")
afind(texts, patterns, method="running_cosine", q=3)
grabl(texts,"grew", maxDist=1)
extract(texts, "harvested", maxDist=3)
# }
Run the code above in your browser using DataLab