SplitInformation()
addresses bipartition splits, which correspond to
edges in an unrooted phylogeny; MultiSplitInformation()
supports splits
that subdivide taxa into multiple partitions, which may correspond to
multi-state characters in a phylogenetic matrix.
A simple way to characterise trees is to count the number of edges.
(Edges are almost, but not quite, equivalent to nodes.)
Counting edges (or nodes) provides a quick measure of a tree's resolution,
and underpins the Robinson-Foulds tree distance measure.
Not all edges, however, are created equal.
An edge splits the leaves of a tree into two subdivisions. The more equal
these subdivisions are in size, the more instructive this edge is.
Intuitively, the division of mammals from reptiles is a profound revelation
that underpins much of zoology; recognizing that two species of bat are more
closely related to each other than to any other mammal or reptile is still
instructive, but somewhat less fundamental.
Formally, the phylogenetic (Shannon) information content of a split S,
h(S), corresponds to the probability that a uniformly selected random tree
will contain the split, P(S): h(S) = -log P(S).
Base 2 logarithms are typically employed to yield an information content in
bits.
As an example, the split AB|CDEF
occurs in 15 of the 105 six-leaf trees;
h(AB|CDEF
) = -log P(AB|CDEF
) = -log(15/105) ~ 2.81 bits. The split
ABC|DEF
subdivides the leaves more evenly, and is thus more instructive:
it occurs in just nine of the 105 six-leaf trees, and
h(ABC|DEF
) = -log(9/105) ~ 3.54 bits.
As the number of leaves increases, a single even split may contain more
information than multiple uneven splits -- see the examples section below.
Summing the information content of all splits within a tree, perhaps using
the 'TreeDist' function
SplitwiseInfo()
,
arguably gives a more instructive picture of its resolution than simply
counting the number of splits that are present -- though with the caveat
that splits within a tree are not independent of one another, so some
information may be double counted. (This same charge applies to simply
counting nodes, too.)
Alternatives would be to count the number of quartets that are resolved,
perhaps using the 'Quartet' function
QuartetStates()
,
or to use a different take on the information contained within a split, the
clustering information: see the 'TreeDist' function
ClusteringInfo()
for details.