char.diff: Character differences

Description

Calculates the character difference from a discrete matrix

Usage

char.diff(
  matrix,
  method = "hamming",
  translate = TRUE,
  special.tokens,
  special.behaviours,
  order = FALSE,
  by.col = TRUE,
  correction
)

Value

A character difference value or a matrix of class char.diff

Arguments

matrix: A discrete matrix or a list containing discrete characters. The differences is calculated between the columns (usually characters). Use t(matrix) or by.col = FALSE to calculate the differences between the rows.
method: The method to measure difference: "hamming" (default; Hamming 1950), "manhattan", "comparable", "euclidean", "maximum", "mord" (Lloyd 2016), "none" or "binary".
translate: logical, whether to translate the characters following the xyz notation (TRUE - default; see details - Felsenstein 2004) or not (FALSE). Translation works for up to 26 tokens per character.
special.tokens: optional, a named vector of special tokens to be passed to grep (make sure to protect the character with "\\"). By default special.tokens <- c(missing = "\\?", inapplicable = "\\-", polymorphism = "\\&", uncertainty = "\\/"). Note that NA values are not compared and that the symbol "@" is reserved and cannot be used.
special.behaviours: optional, a list of one or more functions for a special behaviour for special.tokens. See details.
order: logical, whether the character should be treated as order (TRUE) or not (FALSE - default). This argument can be a logical vector equivalent to the number of rows or columns in matrix (depending on by.col) to specify ordering for each character.
by.col: logical, whether to measure the distance by columns (TRUE - default) or by rows (FALSE).
correction: optional, an eventual function to apply to the matrix after calculating the distance.

Author

Thomas Guillerme

Details

Each method for calculating distance is expressed as a function of \(d(x, y)\) where \(x\) and \(y\) are a pair of columns (if by.col = TRUE) or rows in the matrix and n is the number of comparable rows (if by.col = TRUE) or columns between them and i is any specific pair of rows (if by.col = TRUE) or columns. The different methods are:

"hamming" The relative distance between characters. This is equal to the Gower distance for non-numeric comparisons (e.g. character tokens; Gower 1966). \(d(x,y) = \sum[i,n](abs(x[i] - y[i])/n\)
"manhattan" The "raw" distance between characters: \(d(x,y) = \sum[i,n](abs(x[i] - y[i])\)
"comparable" The number of comparable characters (i.e. the number of tokens that can be compared): \(d(x,y) = \sum[i,n]((x[i] - y[i])/(x[i] - y[i]))\)
"euclidean" The euclidean distance between characters: \(d(x,y) = \sqrt(\sum[i,n]((x[i] - y[i])^2))\)
"maximum" The maximum distance between characters: \(d(x,y) = max(abs(x[i] - y[i]))\)
"mord" The maximum observable distance between characters (Lloyd 2016): \(d(x,y) = \sum[i,n](abs(x[i] - y[i])/\sum[i,n]((x[i] - y[i])/(x[i] - y[i])\)
"none" Returns the matrix with eventual converted and/or translated tokens.
"binary" Returns the matrix with the binary characters.

When using translate = TRUE, the characters are translated following the xyz notation where the first token is translated to 1, the second to 2, etc. For example, the character 0, 2, 1, 0 is translated to 1, 2, 3, 1. In other words when translate = TRUE, the character tokens are not interpreted as numeric values. When using translate = TRUE, scaled metrics (i.e "hamming" and "gower") are divide by \(n-1\) rather than \(n\) due to the first character always being equal to 1.

special.behaviours allows to generate a special rule for the special.tokens. The functions should can take the arguments character, all_states with character being the character that contains the special token and all_states for the character (which is automatically detected by the function). By default, missing data returns and inapplicable returns NA, and polymorphisms and uncertainties return all present states.

missing = function(x,y) NA
inapplicable = function(x,y) NA
polymorphism = function(x,y) strsplit(x, split = "\\&")[[1]]
uncertainty = function(x,y) strsplit(x, split = "\\/")[[1]]

Functions in the list must be named following the special token of concern (e.g. missing), have only x, y as inputs and a single output a single value (that gets coerced to integer automatically). For example, the special behaviour for the special token "?" can be coded as: special.behaviours = list(missing = function(x, y) return(y) to make all comparisons containing the special token containing "?" return any character state y.

IMPORTANT: Note that for any distance method, NA values are skipped in the distance calculations (e.g. distance(A = {1, NA, 2}, B = {1, 2, 3}) is treated as distance(A = {1, 2}, B = {1, 3})).

IMPORTANT: Note that the number of symbols (tokens) per character is limited by your machine's word-size (32 or 64 bits). If you have more than 64 tokens per character, you might want to use continuous data.

References

Felsenstein, J. 2004. Inferring phylogenies vol. 2. Sinauer Associates Sunderland. Gower, J.C. 1966. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53:325-338. Hamming, R.W. 1950. Error detecting and error correcting codes. The Bell System Technical Journal. DOI: 10.1002/j.1538-7305.1950.tb00463.x. Lloyd, G.T. 2016. Estimating morphological diversity and tempo with discrete character-taxon matrices: implementation, challenges, progress, and future directions. Biological Journal of the Linnean Society. DOI: 10.1111/bij.12746.

Examples

Run this code

## Comparing two binary characters
char.diff(list(c(0, 1, 0, 1), c(0, 1, 1, 1)))

## Pairwise comparisons in a morphological matrix
morpho_matrix <- matrix(sample(c(0,1), 100, replace = TRUE), 10)
char.diff(morpho_matrix)

## Adding special tokens to the matrix
morpho_matrix[sample(1:100, 10)] <- c("?", "0&1", "-")
char.diff(morpho_matrix)

## Modifying special behaviours for tokens with "&" to be treated as NA
char.diff(morpho_matrix,
          special.behaviours = list(polymorphism = function(x,y) return(NA)))

## Adding a special character with a special behaviour (count "%" as "100")
morpho_matrix[sample(1:100, 5)] <- "%"
char.diff(morpho_matrix,
          special.tokens = c("paragraph" = "\\%"),
          special.behaviours = list(paragraph = function(x,y) as.integer(100)))

## Comparing characters with/without translation
char.diff(list(c(0, 1, 0, 1), c(1, 0, 1, 0)), method = "manhattan")
# no character difference
char.diff(list(c(0, 1, 0, 1), c(1, 0, 1, 0)), method = "manhattan",
          translate = FALSE)
# all four character states are different

Run the code above in your browser using DataLab