Each method for calculating distance is expressed as a function of \(d(x, y)\) where \(x\) and \(y\) are a pair of columns (if by.col = TRUE) or rows in the matrix and n is the number of comparable rows (if by.col = TRUE) or columns between them and i is any specific pair of rows (if by.col = TRUE) or columns.
The different methods are:
"hamming" The relative distance between characters. This is equal to the Gower distance for non-numeric comparisons (e.g. character tokens; Gower 1966).
\(d(x,y) = \sum[i,n](abs(x[i] - y[i])/n\)
"manhattan" The "raw" distance between characters:
\(d(x,y) = \sum[i,n](abs(x[i] - y[i])\)
"comparable" The number of comparable characters (i.e. the number of tokens that can be compared):
\(d(x,y) = \sum[i,n]((x[i] - y[i])/(x[i] - y[i]))\)
"euclidean" The euclidean distance between characters:
\(d(x,y) = \sqrt(\sum[i,n]((x[i] - y[i])^2))\)
"maximum" The maximum distance between characters:
\(d(x,y) = max(abs(x[i] - y[i]))\)
"mord" The maximum observable distance between characters (Lloyd 2016):
\(d(x,y) = \sum[i,n](abs(x[i] - y[i])/\sum[i,n]((x[i] - y[i])/(x[i] - y[i])\)
"none" Returns the matrix with eventual converted and/or translated tokens.
"binary" Returns the matrix with the binary characters.
When using translate = TRUE, the characters are translated following the xyz notation where the first token is translated to 1, the second to 2, etc. For example, the character 0, 2, 1, 0 is translated to 1, 2, 3, 1. In other words when translate = TRUE, the character tokens are not interpreted as numeric values. When using translate = TRUE, scaled metrics (i.e "hamming" and "gower") are divide by \(n-1\) rather than \(n\) due to the first character always being equal to 1.
special.behaviours allows to generate a special rule for the special.tokens. The functions should can take the arguments character, all_states with character being the character that contains the special token and all_states for the character (which is automatically detected by the function). By default, missing data returns and inapplicable returns NA, and polymorphisms and uncertainties return all present states.
missing = function(x,y) NA
inapplicable = function(x,y) NA
polymorphism = function(x,y) strsplit(x, split = "\\&")[[1]]
uncertainty = function(x,y) strsplit(x, split = "\\/")[[1]]
Functions in the list must be named following the special token of concern (e.g. missing), have only x, y as inputs and a single output a single value (that gets coerced to integer automatically). For example, the special behaviour for the special token "?" can be coded as: special.behaviours = list(missing = function(x, y) return(y) to make all comparisons containing the special token containing "?" return any character state y.
IMPORTANT: Note that for any distance method, NA values are skipped in the distance calculations (e.g. distance(A = {1, NA, 2}, B = {1, 2, 3}) is treated as distance(A = {1, 2}, B = {1, 3})).
IMPORTANT: Note that the number of symbols (tokens) per character is limited by your machine's word-size (32 or 64 bits). If you have more than 64 tokens per character, you might want to use continuous data.