calculate_tree_length: Calculates the parsimony length of a set of phylogenetic tree(s)

Description

Given a tree, or set of trees, and a cladistic matrix returns their parsimony length in number of steps.

Usage

calculate_tree_length(
  trees,
  cladistic_matrix,
  inapplicables_as_missing = FALSE,
  polymorphism_behaviour,
  uncertainty_behaviour,
  polymorphism_geometry,
  polymorphism_distance,
  state_ages,
  dollo_penalty
)

Value

A list with multiple components, including:

input_trees: The tree(s) used as input.
input_matrix: The raw (unmodified) cladistic_matrix input.
input_options: The various input options used. Output for use by downstream functions, such as ancestral state estimation and stochastic character mapping.
costmatrices: The costmatrices (one for each character) used. These are typically generated automatically by the funcion, but are output here for later use in ancestral state estimation and stochastic character mapping functions.
character_matrix: The single character matrix object used. Essentially the input_matrix modified by the input_options.
character_lengths: A matrix of characters (rows) and trees (columns) with values indicating the costs. The column sums of this matrix are the tree_lengths values. This output can also be used for homoplasy metrics.
character_weights: A vector of the character weights used.
tree_lengths: The primary output - the length for each input tree in total cost.
node_values: The values (lengths for each state) for each node acrss trees and characters. This is used by reconstruct_ancestral_states for ancestral state reconstruction.

Arguments

trees: A tree (phylo object) or set of trees (multiPhylo object).
cladistic_matrix: A character-taxon matrix in the format imported by read_nexus_matrix. These should be discrete with rownames (ytaxon labels) matching the tip labels of trees.
inapplicables_as_missing: Logical that decides whether or not to treat inapplicables as missing (TRUE) or not (FALSE, the default and recommended option).
polymorphism_behaviour: One of either "missing", "uncertainty", "polymorphism", or "random". See details.
uncertainty_behaviour: One of either "missing", "uncertainty", "polymorphism", or "random". See details.
polymorphism_geometry: Argument passed to make_costmatrix.
polymorphism_distance: Argument passed to make_costmatrix.
state_ages: Argument passed to make_costmatrix.
dollo_penalty: Argument passed to make_costmatrix.

Author

Graeme T. Lloyd graemetlloyd@gmail.com

Details

Under the maximum parsimony criterion, a phylogenetic hypothesis is considered optimal if it requires the fewest number of evolutionary "steps" or - to generalise to non-discrete values - minimum total cost. In order to evalulate this criterion we must therefore be able to calculate a tree's "length" (total cost assuming the lowest cost for every character used). Given a set of phylogenetic hypothes(es) and a cladistic matrix this function calculates the minimum length for each tree.

Input data format

This function operates on a phylogenetic tree, or trees (in ape format), and a cladistic matrix (in cladisticMatrix format). However, the algorithm used is based on the generalised costmatrix approach of Swofford and Maddison (1992) and hence costmatrices need to be defined for each character (this is done internally by calling make_costmatrix), and some of the options are merely passed to this function.

Algorithm

Technically the Swofford and Maddison (1992) algorithm is designed for ancestral state reconstruction, but as its' first pass of the tree assigns lengths for each possible state at each node the minimum value of these options at the root is also the tree length for that character and hence by skipping the later steps this can be used as a tree length algorithm by simply summing the values across each character. The choice of the Swofford and Maddison algorithm, rather than the Wagner or Fitch algorithms (for ordered and unordered characters, respectively) is to generalize to the broadest range of character types, including asymmetric characters (Camin-Sokal, Dollo, stratigraphic), custom character types (specified using costmatrices or character state trees), as well as to any resolution of tree (i.e., including multifurcating trees - important for establishing maximum costs for homoplasy indices). The only restriction here is that the tree must be rooted such that time's arrow is explicitly present. This is essential, as the root defines the lengths across the whole tree, but also for asymmetric characters directionality must be explicit, as well as some downstream approaches (such as ACCTRAN and DELTRAN). The two obvious drawbacks to this algorithm are that it can be slower and that it is not appropriate for unrooted trees.

Costmatrices and costmatrix options

Costmatrices are described in detail in the make_costmatrix manual, as are the options that are passed from this function to that one. Thus, the user is directed there for a more in-depth discussion of options.

Inapplicable and missing characters

In practice these two character types are treated the same for length calculations - in effect these are "free" characters that do not constrain the tree length calculation in the same way that a coded character would (because a coded character's transition cost must be accounted for; Swofford and Maddison 1992). Note that there are reasons to take differences into account in phylogenetic inference itself (see papers by Brazeau et al. 2019 and Goloboff et al. in press). The option to treat them differently here is therefore only important in terms of downstream analyses, such as ancestral state reconstruction (see reconstruct_ancestral_states for details).

Polymorphisms and uncertainties

Polymorphisms (coded with empersands between states) and uncertainties (coded with slashes between states) can be interpreted in different ways, including those that affect estimates of tree length. Hence four options are provided to the user here:

Missing (polymorphism_behaviour = "missing" or uncertainty_behaviour = "missing"). Here polymorphisms are simply replaced by the missing character (NA). This removes polymorphisms and uncertainties from the calculation process completely (likely leading to undercounts), and hence is not generally recommended.
Uncertainty (polymorphism_behaviour = "uncertainty" or uncertainty_behaviour = "uncertainty"). This is the intended use of uncertain codings (e.g., 0/1) and constrains the tree length calculation to having to explain the least costly transition of those in the uncertainty. This is recommended for uncertain characters (although note that it biases the result towards the shortest possible length), but not truly polymorphic characters (as one or more state acquisitions are being missed, see Nixon and Davis 1991 and make_costmatrix for discussion of this). This is also - to the best of my knowledge - the approach used by most parsimony software, such as PAUP* (Swofford 2003) and TNT (Goloboff et al. 2008; Goloboff and Catalano 2016).
Polymorphism (polymorphism_behaviour = "polymorphism" or uncertainty_behaviour = "polymorphism"). If polymorphisms are real then some means of accounting for the changes that produce them seems appropriate, albeit difficult (see Nixon and Davis 1991 and Swofford and Maddison 1992 for discussions). If this option is applied it triggers the downstream options in make_costmatrix (by internally setting include_polymorphisms = TRUE), and the user should look there for more information. This is tentatively recommended for true polymorphisms (but note that it complicates interpretation), but not uncertainties.
Random (polymorphism_behaviour = "random" or uncertainty_behaviour = "random"). Another means of dealing with multiple-state characters is simply to sample a single state at random for each one, for example as Watanabe (2016) did with their PERDA algorithm. This simplifies the process, but also logically requires running the function multiple times to quantify uncertainty. This is not recommended for true polymorphisms (as interpretation is confounded), but may be appropriate for a less downwards biased tree count than "uncertainty".

These choices can also effect ancestral state estimation (see reconstruct_ancestral_states).

Polytomies

Polytomies are explicitly allowed by the function, but will always be treated as "hard" (i.e., literal multifurcations). Note that typically these will lead to higher tree lengths than fully bifurcating trees and indeed that the maximum cost is typically calculated from the star tree (single multifurcation).

Further constraints

In future the function will allow restrictions to be placed on the state at particular internal nodes. This can have multiple applications, including (for example) treating some taxa as ancestral such that their states are directly tied to specific nodes, e.g., in stratocladistics (Fisher 1994; Marcot and Fox 2008).

Character weights

Tree lengths output already include corrections for character weights as supplied in the cladistic_matrix input. So, for example, if a binary character costs two on the tree, but is weighted five then it will contribute a total cosr of 10 to the result.

References

Brazeau, M. D., Guillerme, T. and Smith, M. R., 2019. An algorithm for morphological phylogenetic analysis with inapplicable data. Systematic Biology, 68, 619-631.

Fisher, D. C., 1994. Stratocladistics: morphological and temporal patterns and their relation to phylogenetic process. In L. Grande and O. Rieppel (eds.), Interpreting the Hierarchy of Nature. Academic Press, San Diego. pp133–171.

Goloboff, P. A. and Catalano, S. A., 2016. TNT version 1.5, including a full implementation of phylogenetic morphometrics/ Cladistics, 32. 221-238

Goloboff, P., Farris, J. and Nixon, K., 2008. TNT, a free program for phylogenetic analysis. Cladistics, 24, 774-786.

Goloboff, P. A., De Laet, J., Rios-Tamayo, D. and Szumik, C. A., in press. A reconsideration of inapplicable characters, and an approximation with step‐matrix recoding. Cladistics.

Marcot, J. D. and Fox, D. L., 2008. StrataPhy: a new computer program for stratocladistic analysis. Palaeontologia Electronica, 11, 5A.

Nixon, K. C. and Davis, J. I., 1991. Polymorphic taxa, missing values and cladistic analysis. Cladistics, 7, 233-241.

Swofford, D. L., 2003. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts.

Swofford, D. L. and Maddison, W. P., 1992. Parsimony, character-state reconstructions, and evolutionary inferences. In R. L. Mayden (ed.) Systematics, Historical Ecology, and North American Freshwater Fishes. Stanford University Press, Stanford, p187-223.

Watanabe, A., 2016. The impact of poor sampling of polymorphism on cladistic analysis. Cladistics, 32, 317-334.

Examples

Run this code


# Use Gauthier 1986 as example matrix:
cladistic_matrix <- Claddis::gauthier_1986

# Use one of the MPTs from a TNT analysis as the tree:
tree <- ape::read.tree(
  text = paste(
    "(Outgroup,(Ornithischia,(Sauropodomorpha,(Ceratosauria,Procompsognathus,",
    "Liliensternus,(Carnosauria,(Ornithmimidae,Saurornitholestes,Hulsanpes,(Coelurus,",
    "Elmisauridae,(Compsognathus,(Ornitholestes,Microvenator,Caenagnathidae,",
    "(Deinonychosauria,Avialae))))))))));",
    sep = ""
  )
)

# Calculate tree length (and only use tree lengths from output):
calculate_tree_length(
  trees = tree,
  cladistic_matrix = cladistic_matrix,
  inapplicables_as_missing = TRUE,
  polymorphism_behaviour = "uncertainty",
  uncertainty_behaviour = "uncertainty",
  polymorphism_geometry = "simplex",
  polymorphism_distance = "euclidean",
  state_ages = c(200, 100),
  dollo_penalty = 999
)$tree_lengths

Run the code above in your browser using DataLab