Given a phylogenetic tree and a set of states, permutes all possible tip state combinations.
permute_tipstates(tree, states, all_states_present = TRUE)
A matrix where each row is a unique suite of states and each column a lebelled tip of the input tree.
A phylogenetic tree in ape format.
A vector of character states.
A logical indicating whether or not all states should appear at least once. (Defaults to TRUE
.)
Graeme T. Lloyd graemetlloyd@gmail.com
When calculating g_max or otherwise determining the limits for a given character type it can be useful to generate (permute) all possible combinations of tip states for a given tree. For example, let's imagine we have a binary character (states 0 and 1) and four tips (labelled A-D). We could simply assign each possible state to each possible label to get the following sixteen permutations:
______________________
| # | A | B | C | D |
----------------------
| 1 | 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 |
| 3 | 0 | 1 | 0 | 0 |
| 4 | 0 | 0 | 1 | 0 |
| 5 | 0 | 0 | 0 | 1 |
| 6 | 1 | 1 | 0 | 0 |
| 7 | 1 | 0 | 1 | 0 |
| 8 | 1 | 0 | 0 | 1 |
| 9 | 0 | 1 | 1 | 0 |
| 10 | 0 | 1 | 0 | 1 |
| 11 | 0 | 0 | 1 | 1 |
| 12 | 1 | 1 | 1 | 0 |
| 13 | 1 | 1 | 0 | 1 |
| 14 | 1 | 0 | 1 | 1 |
| 15 | 0 | 1 | 1 | 1 |
| 16 | 1 | 1 | 1 | 1 |
----------------------
We can know there are sixteen permutations before we even check as logically for any N states we simply multiply T times (for T tips). I.e., the answer is given by N^T. As here N is 2 and T is 4 this is 2^4 = 2 x 2 x 2 x 2 = 16.
Technically this achieves our goal - we have generated all possible tip states, but there are at least two reasons this approach is suboptimal. Firstly, permutations 1 and 16 are invariant (all tip states are the same) and so are unlikely to be of interest. Or, to generalise this, we might not want to consider permutations where not all possible states are sampled - i.e., we may wish to stipulate that every state appear at least once. Secondly, this approach does not consider the structure of the phylogenetic tree of interest. It might not be immediately obvious why this matters, so let's consider another example, Sticking with our binary character and four tips let's consider the tree:
A B C D
\ \ \ /
\ \ /
\ \ /
\ /
\ /
And the following permutations (from above):
______________________
| # | A | B | C | D |
----------------------
| 4 | 0 | 0 | 1 | 0 |
| 5 | 0 | 0 | 0 | 1 |
| 7 | 1 | 0 | 1 | 0 |
| 8 | 1 | 0 | 0 | 1 |
| 9 | 0 | 1 | 1 | 0 |
| 10 | 0 | 1 | 0 | 1 |
| 12 | 1 | 1 | 1 | 0 |
| 13 | 1 | 1 | 0 | 1 |
----------------------
Here are the eight permutations where C and D are assigned different states. However, in practical terms there is no need to generate half of these permutations as the order CD/DC doesn't matter (due to the principal of free rotation). In other words, the "identity" (i.e., the label of a tip) is not as important as the structure of the tree.
Thus, there are two ways in which the simple ordered permutation method inefficiently generates permutations that are redundant with respect to a given tree. This is important as scaling the simple assignment of states to labels (N^T) can rapidly lead to very large numbers of permutations:
___________________________________________________________________
| Number of tips |
------------------------------------------------------------------------------
| N States | 5 | 10 | 20 | 50 | 100 |
------------------------------------------------------------------------------
| 2 | 32 | 1,024 | 1,048,576 | 1.13 x 10^15 | 1.27 x 10^30 |
| 3 | 243 | 59,049 | 3,486,784,401 | 7.18 x 10^23 | 5.15 x 10^47 |
| 4 | 1,024 | 1,048,576 | 1.10 x 10^12 | 1.27 x 10^30 | 1.61 x 10^60 |
| 5 | 3,125 | 9,765,625 | 9.54 x 10^13 | 8.88 x 10^34 | 7.89 x 10^69 |
------------------------------------------------------------------------------
Thus for many empirically realistic examples the computational expense is too great and a more complex, less redundant, permutation algorithm is desirable that directly considers both tree structure and the desire for full variance sampling.
As this is not a simple algorithm let's begin by consider a simple and limiting case - the star tree:
A B C D
| | | |
| | | |
-------------
Here there is no "normal" tree structure, but in practical terms this means there is no reason to consider which label a tip state is assigned to. If we again consider our binary character from above this means by pruning invariant and redundant partitions we get just three permutations:
______________________
| # | A | B | C | D |
----------------------
| 2 | 1 | 0 | 0 | 0 |
| 6 | 1 | 1 | 0 | 0 |
| 12 | 1 | 1 | 1 | 0 |
----------------------
Put simply, these are the cases where there is one, two, or three of state 1, and three, two or one of state 0. Thus we have dramatically reduced the number of permutations, here by over 80
__________________________
| Number of tips |
-------------------------------------
| N States | 5 | 10 | 20 | 50 | 100 |
-------------------------------------
| 2 | 4 | 9 | 19 | 49 | 99 |
-------------------------------------
However, things become a little more complex when we consider a third state. It also makes more sense to start tabulating these permutations by the number of times a state appears instead of arbitrarily assigning states to a particular label. So our binary example would look like this:
_________
| 0 | 1 |
---------
| 3 | 1 |
| 2 | 2 |
| 1 | 3 |
---------
If we now add a third possible state (2) we can permute all the possible scenarios (excluding those where a state doesn't appear at all) as:
_____________
| 0 | 1 | 2 |
-------------
| 2 | 1 | 1 |
| 1 | 2 | 1 |
| 1 | 1 | 2 |
-------------
Thus for this example there are still only three options. This is because stipulating that every state appears at least once means there is only one "free" assignment that can be allocated as either state 0, state 1 or state 2 - accounting for the three permutations. In other words, the assignments here had to be 2, 1, and 1, the only distinction is the "order" of these assignments. Let's consider a fifth tip but stick with a star tree. Now the assignments can be 3, 1, and 1 or 2, 2, and 1:
_____________
| 0 | 1 | 2 |
-------------
| 3 | 1 | 1 |
| 1 | 3 | 1 |
| 1 | 1 | 3 |
| 2 | 2 | 1 |
| 2 | 1 | 2 |
| 1 | 2 | 2 |
-------------
Thus by increasing the number of tips by one in this instance we have also doubled the permutations. Note, though, that we do not need an additional column to represent this, just additional rows. Adding a sixth tip we get:
_____________
| 0 | 1 | 2 |
-------------
| 4 | 1 | 1 |
| 1 | 4 | 1 |
| 1 | 1 | 4 |
| 3 | 2 | 1 |
| 3 | 1 | 2 |
| 2 | 3 | 1 |
| 1 | 3 | 2 |
| 1 | 2 | 3 |
| 2 | 1 | 3 |
| 2 | 2 | 2 |
-------------
Now we have three possible assignments (4-1-1, 3-2-1 and 2-2-2), but find the ways to permute these are no longer equal (3, 6 and 1, respectively). This makes it harder to derive an equation that will allow us to scale the problem to any value of N or T. We can repeat our table from above though to get a general sense of scale. Remember that this is for the star tree with the additional stipulation that every state appears at least once:
__________________________________________
| Number of tips |
-----------------------------------------------------
| N States | 5 | 10 | 20 | 50 | 100 |
-----------------------------------------------------
| 2 | 4 | 9 | 19 | 49 | 99 |
| 3 | 6 | 36 | 171 | 1,176 | 4,851 |
| 4 | 4 | 84 | 969 | 18,424 | 156,849 |
| 5 | 1 | 126 | 3,876 | 211,876 | 3,764,376 |
-----------------------------------------------------
We see, then, that this scales much more reasonably than our initial approach, although this is the "best case" scenario of no tree structure. We can also relax things a little by allowing states to not always appear:
______________________________________________
| Number of tips |
---------------------------------------------------------
| N States | 5 | 10 | 20 | 50 | 100 |
---------------------------------------------------------
| 2 | 6 | 11 | 21 | 51 | 101 |
| 3 | 21 | 66 | 231 | 1,326 | 5,151 |
| 4 | 56 | 286 | 1,771 | 23,426 | 176,851 |
| 5 | 126 | 1,001 | 10,626 | 316,251 | 4,598,126 |
---------------------------------------------------------
These are larger, but not intolerably so.
In practice things can become more complex when the full variety of possible treeshapes are considered and in fact the present algorithm is not without the possibility of redundancies. Specifically, because trees can contain repeating "motifs" that share a common ancestor fewer permutations would still cover the full range of possibilities. For example, consider the tree:
A B C D E F
\ \ / \ \ /
\ / \ /
\ / \ /
\ /
\ /
\ /
\ /
\ /
\ /
Here the motifs:
A B C D E F
\ \ / \ \ /
\ / and \ /
\ / \ /
Are the same as such so would be the permutations:
_________________________
| A | B | C | D | E | F |
-------------------------
| 0 | 1 | 1 | 1 | 0 | 0 |
| 1 | 0 | 0 | 0 | 1 | 1 |
-------------------------
Currently the function does not account for these, nor is a closed form equation known that would generate the true smallest number of unique permtations.
Instead, it simply permutes each "berry" (a node where all descendants are terminals, an extension of the notion pf phylogenetic "cherries") as though it is the star tree, and permuting all other tips as simple labellings.
# Permute four tip states (A, C, G, T) on a six-taxon tree:
permute_tipstates(
tree = ape::read.tree(text = "((A,B),(C,(D,(E,F))));"),
states = c("A", "C", "G", "T"),
all_states_present = TRUE
)
# Permute three tip states (0, 1, 2) on a ten-taxon star tree:
permute_tipstates(
tree = ape::read.tree(text = "(A,B,C,D,E,F,G,H,I);"),
states = c("0", "1", "2"),
all_states_present = TRUE
)
Run the code above in your browser using DataLab