Each method for calculating distance is expressed as a function of \(d(x, y)\) where \(x\) and \(y\) are a pair of columns (if by.col = TRUE
) or rows in the matrix and n is the number of comparable rows (if by.col = TRUE
) or columns between them and i is any specific pair of rows (if by.col = TRUE
) or columns.
The different methods are:
"hamming"
The relative distance between characters. This is equal to the Gower distance for non-numeric comparisons (e.g. character tokens; Gower 1966).
\(d(x,y) = \sum[i,n](abs(x[i] - y[i])/n\)
"manhattan"
The "raw" distance between characters:
\(d(x,y) = \sum[i,n](abs(x[i] - y[i])\)
"comparable"
The number of comparable characters (i.e. the number of tokens that can be compared):
\(d(x,y) = \sum[i,n]((x[i] - y[i])/(x[i] - y[i]))\)
"euclidean"
The euclidean distance between characters:
\(d(x,y) = \sqrt(\sum[i,n]((x[i] - y[i])^2))\)
"maximum"
The maximum distance between characters:
\(d(x,y) = max(abs(x[i] - y[i]))\)
"mord"
The maximum observable distance between characters (Lloyd 2016):
\(d(x,y) = \sum[i,n](abs(x[i] - y[i])/\sum[i,n]((x[i] - y[i])/(x[i] - y[i])\)
"none"
Returns the matrix with eventual converted and/or translated tokens.
"binary"
Returns the matrix with the binary characters.
When using translate = TRUE
, the characters are translated following the xyz notation where the first token is translated to 1, the second to 2, etc. For example, the character 0, 2, 1, 0
is translated to 1, 2, 3, 1
. In other words when translate = TRUE
, the character tokens are not interpreted as numeric values. When using translate = TRUE
, scaled metrics (i.e "hamming"
and "gower"
) are divide by \(n-1\) rather than \(n\) due to the first character always being equal to 1.
special.behaviours
allows to generate a special rule for the special.tokens
. The functions should can take the arguments character, all_states
with character
being the character that contains the special token and all_states
for the character (which is automatically detected by the function). By default, missing data returns and inapplicable returns NA
, and polymorphisms and uncertainties return all present states.
missing = function(x,y) NA
inapplicable = function(x,y) NA
polymorphism = function(x,y) strsplit(x, split = "\\&")[[1]]
uncertainty = function(x,y) strsplit(x, split = "\\/")[[1]]
Functions in the list must be named following the special token of concern (e.g. missing
), have only x, y
as inputs and a single output a single value (that gets coerced to integer
automatically). For example, the special behaviour for the special token "?"
can be coded as: special.behaviours = list(missing = function(x, y) return(y)
to make all comparisons containing the special token containing "?"
return any character state y
.
IMPORTANT: Note that for any distance method, NA
values are skipped in the distance calculations (e.g. distance(A = {1, NA, 2}, B = {1, 2, 3}
) is treated as distance(A = {1, 2}, B = {1, 3}
)).
IMPORTANT: Note that the number of symbols (tokens) per character is limited by your machine's word-size (32 or 64 bits). If you have more than 64 tokens per character, you might want to use continuous data.