optCluster
performs statistical and/or biological validation of
clustering results and determines the optimal clustering algorithm and
number of clusters through rank aggreation. The function returns an
object of class "'>optCluster"
.
optCluster(obj, nClust, clMethods = c("clara", "diana", "hierarchical",
"kmeans", "model", "pam", "som", "sota"), countData = FALSE,
validation = c("internal", "stability"), hierMethod = "average",
annotation = NULL, clVerbose = FALSE, rankMethod = "CE",
distance = "Spearman", importance = NULL, rankVerbose = FALSE, ...)
The dataset to be evaluated as either a data frame, a numeric matrix, or an
ExpressionSet
object. Items to be clustered must be the rows of
the data. In the case of data frames, all columns must be numeric.
A numeric vector providing the range of clusters to be evaluated (e.g. to evaluate the number of clusters ranging from 2 to 4, input 2:4). A single number can also be provided.
A character vector providing the names of the clustering algorithms to be used. The available algorithms are: "agnes", "clara", "diana", "fanny", "hierarchical", "kmeans", "model", "pam", "som", "sota", "em.nbinom", "da.nbinom", "sa.nbinom", "em.poisson", "da.poisson", "sa.poisson". Any number of selected methods is allowed. The option "all" may also be used but with some caution, see Clustering Algorithms in the `Details' section for more information.
A logical argument, indicating whether the data is count based or not. Can also be used in conjuction with the "all" option for the 'clMethods' argument. If TRUE and 'clMethods' = "all", all of the clustering algorithms for count data are selected: "em.nbinom", "da.nbinom", "sa.nbinom", "em.poisson", "da.poisson", "sa.poisson". If FALSE and 'clMethods' = "all", all of the relevant clustering algorithms used with continuous data are selected: "agnes", "clara", "diana", "fanny", "hierarchical", "kmeans", "model", "pam", "som", "sota".
A character vector providing the names of the types of validation measures to be used. The options of "internal", "stability", "biological", and "all" are available. Any number or combination of choices is allowed.
A character string,
providing the agglomeration method to be used by the hierarchical clustering options (hclust
and agnes
).
Available choices are "average", "complete", "single", and "ward".
Used in biological validation. Either a character string providing the name of the Bioconductor annotation package for mapping genes to GO categories, or the names of each functional class and the observations that belong to them in either a list or logical matrix format.
If TRUE, the progress of cluster validation will be produced as output.
A character string providing the method to be used for rank aggregation. The two options are the cross-entropy Monte Carlo algorithm ("CE") or Genetic algorithm ("GA"). Selection of only one method is allowed.
A character string providing the type of distance to be used for measuring the similarity between ordered lists in rank aggregation. The two available methods are the weighted Spearman footrule distance ("Spearman") or the weighted Kendall's tau distance ("Kendall"). Selection of only one distance is allowed.
Vector of weights indicating the importance of each validation measure list. Default of NULL represents equal weights to each validation measure. See Weighted Rank Aggregation in the `Details' section for more information.
If TRUE, current rank aggregation results are displayed at each iteration.
Additional arguments that can be passed to internal functions of clValid
or RankAggreg
:
Additional clValid
arguments:
metric
- Metric used to determine distance matrix in validation measures. Possible choices are:
"eucliean" (default), "correlation", and "manhattan".
neighbSize
- Integer giving neighborhood size used in "connectivity" validation measure.
GOcategory
- For biological valdation, a character string providing which GO category to use. Options include:
"BP", "MF", "CC", or "all" (default).
goTermFreq
- For BSI validation, the threshold frequency of GO terms to used for functional annotation.
dropEvidence
- For biological validation, either NULL or a character vector of GO evidence codes to omit.
Additional RankAggreg
arguments:
maxIter
- The maximum number of iterations allowed. Default = 1000
k
- Size of top-k list in aggregation.
convIN
- Stopping criteria for CE and GA algorithms. The algorithm converges once the "best" solution does not
change after convIN iterations. Default: 7 for CE and 30 for GA.
N
- Number of samples generated by MCMC in the CE algorithm. Default = 10*k^2
rho
- For CE algorithm, (rho*N) is the qunatile of candidate list sorted by function values.
weight
- For CE algorithm, the learning factor used in the probability update feature. Default = 0.25
popSize
- For GA algorithm population size in each generation. Default = 100
CP
- For GA algorithm, the crossover probability. Default = 0.4
MP
- For GA algorithm, the mutation probability. Default = 0.01
optCluster
returns an object of class "'>optCluster"
. The class description
is provided in the help file.
This function has been created as an extension of the clValid
function. In addition to the validation
measures and clustering algorithms available in the clValid
function, six clustering algorithms
for count data are included in the optCluster
function. This function also determines a
unique solution for the optimal clustering algorithm and number of clusters through rank aggregation of
validation measure lists. A brief description of the available clustering algorithms, validation measures,
and rank aggregation algorithms is provided below. For more details, please refer to the references.
A total of sixteen clustering algorithms are available for cluster analysis.
Ten clustering algorithms for continuous data are available through the internal function clValid
:
"agnes", "clara", "diana", "fanny", "hierarchical", "kmeans", "model", "pam", "som", and "sota". NOTE: Some
algorithms (especially Fanny) may have difficulty finding certain numbers of clusters. If warnings or errors are
produced, the offending algorithm(s) should be removed from the clMethods
argument.
Six clustering algorithms for count data are available through the MBCluster.Seq package: "em.nbinom", "da.nbinom", "sa.nbinom", "em.poisson", "da.poisson", and "sa.poisson". The expectation maximization (EM) algorithm, and two of its variations, the deterministic annealing (DA) algorithm and the simulated annealing (SA) algorithm, have been proposed for model-based clustering of RNA-Seq count data. These three methods can be based on a mixture of either Poisson distributions or negative binomial distributions. The clustering options for count data reflect both the algorithm and the distribution being used. For example, "da.nbinom" represents the deterministic annealing algorithm based on the negative binomial distribution.
The MBCluster.Seq package uses an adjustment by a normalization factor for these algorithms,
with the default being log(Q3) where Q3 is 75th percentile. A different
normalization factor can be passed through the optCluster
function by using the argument
'Normalizer'.
Four stability validation are provided: average proportion of non-overlap (APN), average distance (AD), average distance between means (ADM), and figure of merit (FOM). These measures compare the clustering partitions established with the full data to the clustering partitions established while removing each column, one at a time. For each measure, an average is taken over all of the removed columns, which should be minimized.
The APN determines the average proportion of observations placed in different clusters for both cases. The APN measure can range from 0 to 1.
The AD calculates the average distance between the observations assigned to the same cluster for both cases. The AD measure can range between zero and infinity.
The ADM computes the average distance between the centers of clusters for observations put into the same cluster for both cases. ADM values can range between zero and infinity.
The FOM measures the average intra-cluster variance for the observations in the removed column, using clustering partitions from the remaining columns. The FOM values can range between zero and infinity.
The three internal validation measures included are: connectivity, Dunn index, and silhouette width.
Connectivity measures the extent at which neighboring observations are clustered together. With a value ranging between zero and infinity, this validation measure should be minimized.
The Dunn index is the ratio of the minimum distance between observations in different clusters to the maximum cluster diameter. With a value between zero and infinity, this measurement should be maximized.
Silhouette width is defined as the average of each observation's silhouette value. The silhouette value is a measurement of the degree of confidence in an observation's clustering assignment. Values near 1 mean that the observation is clustered well, while values near -1 mean the observation is poorly clustered.
The biological homogeneity index (BHI) and the biological stability index (BSI) are the two biological validation measures. They were originally proposed to provide guidance in choosing a clustering technique for microarray data, but can also be used for any other molecular expression data as well. Both measures have a range of [0,1] and should be maximized.
The BHI evaluates how biologically similar defined clusters are by calculating the average proportion of paired genes that are statistically clustered together and have the same functional class.
The BSI examines the consistency of clustering similar biologically functioning genes together. Observations are removed from the dataset one column at a time and the statistical cluster assignments of genes with the same functional class are compared to the cluster assignments based on the full dataset.
The cross-entropy Monte Carlo algorithm and the Genetic algorithm are the two approaches available for rank aggregation and come from the RankAggreg package. Both rank aggregation algorithms can use either the weighted Spearman footrule distance or the weighted Kendall's tau to measure the "distance" between any two ordered lists.
A list of weights for each validation measure list
can be included using the importance
argument. The default value of equal weights (NULL) is
represented by rep(1, length(x)), where x is the character vector of validation measure names. This
means each validation measure list has a weight of 1/length(x).
To manually change the weights, the order of the validation measures selected needs to be known.
The order of validation measures used in optCluster
is provided below:
When selected, stability measures will ALWAYS be listed first and in the following order: "APN", "AD", "ADM", "FOM".
When selected, internal measures will only precede biological measures. The order of these measures is: "Connectivity", "Dunn", "Silhouette".
When selected, biological measures will always be listed last and in the following order: "BHI", "BSI".
Sekula, M., Datta, S., and Datta, S. (2017). optCluster: An R package for determining the optimal clustering algorithm. Bioinformation, 13(3), 101. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5450252
Brock, G., Pihur, V., Datta, S. and Datta, S. (2008). clValid: An R Package for Cluster Validation. Journal of Statistical Software 25(4), https://www.jstatsoft.org/v25/i04.
Datta, S. and Datta, S. (2003). Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19(4): 459-466.
Pihur, V., Datta, S. and Datta, S. (2007). Weighted rank aggregation of cluster validation measures: A Mounte Carlo cross-entropy approach. Bioinformatics 23(13): 1607-1615.
Pihur, V., Datta, S. and Datta, S. (2009). RankAggreg, an R package for weighted rank aggregation. BMC Bioinformatics, 10:62, https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-62.
Si, Y., Liu, P., Li, P., & Brutnell, T. (2014). Model-based clustering for RNA-seq data. Bioinformatics 30(2): 197-205.
For a description of the clValid
function, including all available arguments that can be
passed to it, see clValid
in the clValid package.
For a description of the RankAggreg
function, including all available arguments that can be
passed to it, see RankAggreg
in the RankAggreg package.
For details on the clustering algorithm functions for continuous data see
agnes
, clara
, diana
,
fanny
, and pam
in package cluster,
hclust
and kmeans
in package stats,
som
in package kohonen,
Mclust
in package mclust,
and sota
in package clValid.
For details the on the clustering algorithm functions for count data see
Cluster.RNASeq
in package MBCluster.Seq.
For details on the validation measure functions see
BHI
, BSI
,
stability
, connectivity
and dunn
in package clValid
and silhouette
in package cluster.
# NOT RUN {
## These examples may each take a few minutes to compute
# }
# NOT RUN {
## Obtain Dataset
data(arabid)
## Analysis of Count Data using Internal and Stability Validation Measures
count1 <- optCluster(arabid, 2:4, clMethods = "all", countData = TRUE)
summary(count1)
# Obtain optimal clustering assignment
optAssign(count1)
## Normalize Data with Respect to Library Size
obj <- t(t(arabid)/colSums(arabid))
## Analysis of Normalized Data using Internal and Stability Validation Measures
norm1 <- optCluster(obj, 2:4, clMethods = "all")
summary(norm1)
# Obtain optimal clustering assignment
optAssign(norm1)
#Obtain clustering assignment for diana with 2 clusters
clusterResults(norm1, "diana", k = 2)$cluster
## Analysis with Only UPGMA using Internal and Stability Validation Measures
hier1 <- optCluster(obj, 2:10, clMethods = "hierarchical")
summary(hier1)
## Analysis of Normalized Data using All Validation Measures
## Note: These lines of code require the following Bioconductor
## packages for the biological validation measures:
## "Biobase", "annotate", "GO.db", and "org.At.tair.db".
## If all of these packages are installed, then set
## allBioconductorPackagesInstalled = TRUE
allBioconductorPackagesInstalled = FALSE
if(allBioconductorPackagesInstalled){
require("Biobase")
require("annotate")
require("GO.db")
require("org.At.tair.db")
norm2 <- optCluster(obj, 2:4, clMethods = "all", validation = "all",
annotation = "org.At.tair.db")
summary(norm2)
}
# }
Run the code above in your browser using DataLab