cluster: Divisive k-means clustering.

Description

This function recursively splits a sequence set into smaller and smaller subsets, returning a "dendrogram" object.

Usage

cluster(x, k = 5, residues = NULL, gap = "-", ...)

Arguments

a list or matrix of sequences, possibly an object of class "DNAbin" or "AAbin".

integer. The k-mer size required.

residues

either NULL (default; emitted residues are automatically detected from the sequences), a case sensitive character vector specifying the residue alphabet, or one of the character strings "RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for large lists of character vectors. Specifying the residue alphabet is therefore recommended unless the sequence list is a "DNAbin" or "AAbin" object.

gap

the character used to represent gaps in the alignment matrix (if applicable). Ignored for "DNAbin" or "AAbin" objects. Defaults to "-" otherwise.

...

further arguments to be passed to kmeans (not including centers).

Value

Returns an object of class "dendrogram".

Details

This function creates a tree by successively splitting the dataset into smaller and smaller subsets (recursive partitioning). This is a divisive, or "top-down" approach to tree-building, as opposed to agglomerative "bottom-up" methods such as neighbor joining and UPGMA. It is particularly useful for large large datasets with many sequences (n > 10,000) since the need to compute a large n * n distance matrix is circumvented. Instead, a matrix of k-mer counts is computed, and split recursively row-wise using a k-means clustering algorithm (k = 2). This effectively reduces the time and memory complexity from quadratic to linear, while generally maintaining comparable accuracy.

If a more accurate tree is required, users can increase the value of nstart passed to kmeans via the ... argument. While this can increase computation time, it can improve tree accuracy considerably.

DNA and amino acid sequences can be passed to the function either as a list of non-aligned sequences or a matrix of aligned sequences, preferably in the "DNAbin" or "AAbin" raw-byte format (Paradis et al 2004, 2012; see the ape package documentation for more information on these S3 classes). Character sequences are supported; however ambiguity codes may not be recognized or treated appropriately, since raw ambiguity codes are counted according to their underlying residue frequencies (e.g. the 5-mer "ACRGT" would contribute 0.5 to the tally for "ACAGT" and 0.5 to that of "ACGGT").

To minimize computation time when counting longer k-mers (k > 3), amino acid sequences in the raw "AAbin" format are automatically compressed using the Dayhoff-6 alphabet as detailed in Edgar (2004). Note that amino acid sequences will not be compressed if they are supplied as a list of character vectors rather than an "AAbin" object, in which case the k-mer length should be reduced (k < 4) to avoid excessive memory use and computation time.

References

Edgar RC (2004) Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research, 32, 380-385.

Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290.

Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York.

Examples

Run this code

# NOT RUN {
## Cluster the woodmouse dataset (ape package)
library(ape)
data(woodmouse)
## trim gappy ends to subset global alignment
woodmouse <- woodmouse[, apply(woodmouse, 2, function(v) !any(v == 0xf0))]
## build tree divisively
suppressWarnings(RNGversion("3.5.0"))
set.seed(999)
woodmouse.tree <- cluster(woodmouse, nstart = 5)
## plot tree
op <- par(no.readonly = TRUE)
par(mar = c(5, 2, 4, 8) + 0.1)
plot(woodmouse.tree, main = "Woodmouse phylogeny", horiz = TRUE)
par(op)
# }

Run the code above in your browser using DataLab