IdClusters(myDistMatrix = NULL, method = "UPGMA", cutoff = -Inf, showPlot = FALSE, asDendrogram = FALSE, myXStringSet = NULL, model = MODELS, processors = 1, verbose = TRUE)
NULL
if method
is "inexact"
.
"complete"
, "single"
, "UPGMA"
, "WPGMA"
, "NJ"
, "ML"
, or "inexact"
. (See details section below.)
asDendrogram=TRUE
or showPlot=TRUE
then only one cutoff may be specified. (See details section below.)
method='inexact'
.
TRUE
then the object returned is of class dendrogram
. Not applicable if method='inexact'
.
method
is "ML"
, the DNAStringSet
or RNAStringSet
used in the creation of myDistMatrix
. If method
is "inexact"
, the DNAStringSet
, RNAStringSet
, or AAStringSet
to cluster. Not applicable for other methods.
MODELS
of DNA evolution. Only applicable if method
is "ML"
.
NULL
to automatically detect and use all available processors.
asDendrogram=FALSE
(the default), then a data.frame is returned with a column for each cutoff specified. This data.frame has dimensions $N*M$, where each one of $N$ sequences is assigned to a cluster at the $M$-level of cutoff. The row.names of the data.frame correspond to the dimnames of myDistMatrix
.
If asDendrogram=TRUE
, returns an object of class dendrogram
that can be used for further manipulation and plotting. Leaves of the dendrogram are randomly colored by cluster number.
IdClusters
groups the input sequences into clusters using a set dissimilarities representing the distance between $N$ sequences. Initially a phylogenetic tree is formed using the specified method
. Then each leaf (sequence) of the tree is assigned to a cluster based on its edge lengths to the other sequences. The available clustering methods are described as follows:Ultrametric methods: The method complete
assigns clusters using complete-linkage so that sequences in the same cluster are no more than cutoff
percent apart. The method single
assigns clusters using single-linkage so that sequences in the same cluster are within cutoff
of at least one other sequence in the same cluster. UPGMA
(the default) or WPGMA
assign clusters using average-linkage which is a compromise between the sensitivity of complete-linkage clustering to outliers and the tendency of single-linkage clustering to connect distant relatives that do not appear to be closely related. UPGMA
produces an unweighted tree, where each leaf contributes equally to the average edge lengths, whereas WPGMA
produces a weighted result.
Additive methods: NJ
uses the Neighbor-Joining method proposed by Saitou and Nei that does not assume lineages evolve at the same rate (the molecular clock hypothesis). The NJ
method is typically the most phylogenetically accurate of the above distance-based methods. ML
creates a neighbor-joining tree and then iteratively maximizes the likelihood of the tree given the aligned sequences (myXStringSet
). This is accomplished through a combination of optimizing edge lengths with Brent's method and improving tree topology with nearest-neighbor interchanges (NNIs). When method="ML"
, one or more MODELS
of DNA evolution must be specified. Model parameters are iteratively optimized to maximize likelihood, except base frequencies which are empirically determined. If multiple model
s are given, the best model
is automatically chosen based on BIC calculated from the likelihood and the sample size (defined as the number of variable sites in the DNA sequence).
Sequence-only method: inexact
uses a heuristic algorithm to directly assign sequences to clusters without a distance matrix. First the sequences are ordered by length and the longest sequence becomes the first cluster seed. If the second sequence is less than cutoff
percent distance then it is added to the cluster, otherwise it becomes a new cluster representative. The remaining sequences are matched to cluster representatives based on their k-mer distribution and then aligned to find the closest sequence. This approach is repeated until all sequences belong to a cluster. In the vast majority of cases, this process results in clusters with members separated by less than cutoff
distance, where distance is defined as the percent dissimilarity between the overlapping region of a ``glocal'' alignment.
Multiple cutoffs may be provided if they are in increasing or decreasing order. If cutoff
s are provided in descending order then clustering at each new value of cutoff
is continued within the prior cutoff
's clusters. In this way clusters at lower values of cutoff
are completely contained within their umbrella clusters at higher values of cutoff
. This is useful for defining taxonomy, where lower level groups (e.g., genera) are expected not to straddle multiple higher level groups (e.g., families). If multiple cutoffs are provided in ascending order then clustering at each level of cutoff
is independent of the prior level. This may result in fewer high-level clusters for NJ
and ML
methods, but will have no impact on ultrametric methods. Providing cutoff
s in descending order makes inexact
clustering faster, but has negligible impact on the other method
s.
Ghodsi, M., Liu, B., & Pop, M. (2011) DNACLUST. BMC Bioinformatics, 12(1), 271. doi:10.1186/1471-2105-12-271.
Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4), 406-425.
DistanceMatrix
, Add2DB
, MODELS
# using the matrix from the original paper by Saitou and Nei
m <- matrix(0,8,8)
m[2:8,1] <- c(7, 8, 11, 13, 16, 13, 17)
m[3:8,2] <- c(5, 8, 10, 13, 10, 14)
m[4:8,3] <- c(5, 7, 10, 7, 11)
m[5:8,4] <- c(8, 11, 8, 12)
m[6:8,5] <- c(5, 6, 10)
m[7:8,6] <- c(9, 13)
m[8,7] <- c(8)
# returns an object of class "dendrogram"
myClusters <- IdClusters(m, cutoff=10, method="NJ", showPlot=TRUE, asDendrogram=TRUE)
# example of specifying multiple cutoffs
IdClusters(m, cutoff=c(2,6,10,20)) # returns a data frame
# example of 'inexact' clustering
fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)
IdClusters(myXStringSet=dna, method="inexact", cutoff=0.05)
Run the code above in your browser using DataLab