This function creates a tree by successively splitting
the dataset into smaller and smaller subsets (recursive
partitioning). This is a divisive, or "top-down" approach to tree-building,
as opposed to agglomerative "bottom-up" methods such as neighbor joining
and UPGMA. It is particularly useful for large large datasets with many sequences
(n > 10,000) since the need to compute a large n * n
distance matrix is circumvented.
Instead, a matrix of k-mer counts is computed, and split recursively row-wise
using a k-means clustering algorithm (k = 2). This effectively reduces
the time and memory complexity from quadratic to linear, while generally
maintaining comparable accuracy.
If a more accurate tree is required, users can increase the value
of nstart
passed to kmeans
via the ...
argument.
While this can increase computation time, it can improve tree accuracy
considerably.
DNA and amino acid sequences can be passed to the function either as
a list of non-aligned sequences or a matrix of aligned sequences,
preferably in the "DNAbin" or "AAbin" raw-byte format
(Paradis et al 2004, 2012; see the ape
package
documentation for more information on these S3 classes).
Character sequences are supported; however ambiguity codes may
not be recognized or treated appropriately, since raw ambiguity
codes are counted according to their underlying residue frequencies
(e.g. the 5-mer "ACRGT" would contribute 0.5 to the tally for "ACAGT"
and 0.5 to that of "ACGGT").
To minimize computation time when counting longer k-mers (k > 3),
amino acid sequences in the raw "AAbin" format are automatically
compressed using the Dayhoff-6 alphabet as detailed in Edgar (2004).
Note that amino acid sequences will not be compressed if they
are supplied as a list of character vectors rather than an "AAbin"
object, in which case the k-mer length should be reduced
(k < 4) to avoid excessive memory use and computation time.