Usage
diff_align(text1 = NULL, text2 = NULL, tokenizer = NULL, ignore = NULL, clean = NULL, distance = c("lv", "osa", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = 0, q = 1, p = 0, nthread = getOption("sd_num_thread"), verbose = TRUE, ...)
Arguments
tokenizer
defaults to NULL which will trigger linewise tokenization;
accepts a function that turns a text into a token data frame;
a token data frame has at least three columns:
from (first character of token),
to (last character of token)
token (the token)
ignore
defaults to NULL which means that nothing is ignored;
function that accepts a token data frame (see above) and returns a
possibly subseted data frame of hte same form
clean
defaults to NULL which means that nothing cleaned; accepts a
function that takes a vector of tokens and returns a vector of same
length - potentially clean up
weight
For method='osa'
or 'dl'
, the penalty for
deletion, insertion, substitution and transposition, in that order. When
method='lv'
, the penalty for transposition is ignored. When
method='jw'
, the weights associated with characters of a
,
characters from b
and the transposition weight, in that order.
Weights must be positive and not exceed 1. weight
is ignored
completely when method='hamming'
, 'qgram'
, 'cosine'
,
'Jaccard'
, 'lcs'
, or soundex
.
maxDist
[DEPRECATED AND WILL BE REMOVED|2016] Currently kept for
backward compatibility. It does not offer any speed gain. (In fact, it
currently slows things down when set to anything different from
Inf
).
q
Size of the $q$-gram; must be nonnegative. Only applies to
method='qgram'
, 'jaccard'
or 'cosine'
.
p
Penalty factor for Jaro-Winkler distance. The valid range for
p
is 0 <= p="" <="0.25
. If p=0
(default), the
Jaro-distance is returned. Applies only to method='jw'
.=>
verbose
should function report on its doings via messages or not
...
further arguments passed through to distance function