seq_dist: Compute distance metrics between integer sequences

Description

seq_dist computes pairwise string distances between elements of a and b, where the argument with less elements is recycled. seq_distmatrix computes the distance matrix with rows according to a and columns according to b.

Usage

seq_dist(
  a,
  b,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"),
  weight = c(d = 1, i = 1, s = 1, t = 1),
  q = 1,
  p = 0,
  bt = 0,
  nthread = getOption("sd_num_thread")
)
seq_distmatrix(
  a,
  b,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"),
  weight = c(d = 1, i = 1, s = 1, t = 1),
  q = 1,
  p = 0,
  bt = 0,
  useNames = c("names", "none"),
  nthread = getOption("sd_num_thread")
)

Arguments

(list of) integer or numeric vector(s). Will be converted with as.integer (target)

(list of) integer or numeric vector(s). Will be converted with as.integer (source). Optional for seq_distmatrix.

method

Distance metric. See stringdist-metrics

weight

For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. When method='jw', the weights associated with characters of a, characters from b and the transposition weight, in that order. Weights must be positive and not exceed 1. weight is ignored completely when method='hamming', 'qgram', 'cosine', 'Jaccard', or 'lcs'

Size of the \(q\)-gram; must be nonnegative. Only applies to method='qgram', 'jaccard' or 'cosine'.

Prefix factor for Jaro-Winkler distance. The valid range for p is 0 <= p <= 0.25. If p=0 (default), the Jaro-distance is returned. Applies only to method='jw'.

Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than bt Applies only to method='jw' and p>0.

nthread

Maximum number of threads to use. By default, a sensible number of threads is chosen, see stringdist-parallelization.

useNames

label the output matrix with names(a) and names(b)?

Value

seq_dist returns a numeric vector with pairwise distances between a and b of length max(length(a),length(b).

For seq_distmatrix there are two options. If b is missing, the dist object corresponding to the length(a) X length(a) distance matrix is returned. If b is specified, the length(a) X length(b) distance matrix is returned.

If any element of a or b is NA_integer_, the distance with any matched integer vector will result in NA. Missing values in the sequences themselves are treated as a number and not treated specially (Also see the examples).

Notes

Input vectors are converted with as.integer. This causes truncation for numeric vectors (e.g. pi will be treated as 3L).

Examples

Run this code

# NOT RUN {
# Distances between lists of integer vectors. Note the postfix 'L' to force 
# integer storage. The shorter argument is recycled over (\code{a})
a <- list(c(102L, 107L))                        # fu
b <- list(c(102L,111L,111L),c(102L,111L,111L))  # foo, fo
seq_dist(a,b)

# translate strings to a list of integer sequences 
a <- lapply(c("foo","bar","baz"),utf8ToInt)
seq_distmatrix(a)

# Note how missing values are treated. NA's as part of the sequence are treated 
# as an integer (the representation of NA_integer_).
a <- list(NA_integer_,c(102L, 107L))
b <- list(c(102L,111L,111L),c(102L,111L,NA_integer_))  
seq_dist(a,b)

# }
# NOT RUN {
# Distance between sentences based on word order. Note: words must match exactly or they
# are treated as completely different.
#
# For this example you need to have the 'hashr' package installed.
a <- "Mary had a little lamb"
a.words <- strsplit(a,"[[:blank:]]+")
a.int <- hashr::hash(a.words)
b <- c("a little lamb had Mary",
           "had Mary a little lamb")
b.int <- hashr::hash(strsplit(b,"[[:blank:]]+"))
seq_dist(a.int,b.int)
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab