binomixEstimate(pan.matrix, K.range = 3:5, core.detect.prob = 1, verbose = TRUE)
Panmat
object, see panMatrix
for details.binomixEstimate
returns a Binomix
object, which is a small (S3) extension
of a list
with two components. These two components are named BIC.table and Mix.list.The BIC.table is a matrix listing, in each row, the results for each number of components used,
given by the input K.range. The column Core.size is the estimated number of core gene families,
the column Pan.size is the estimated pan-genome size. The column BIC is the Bayesian
Information Criterion (Schwarz, 1978) that should be used to choose the optimal value for K.
The number of components where BIC is minimized is the optimal. If minimum BIC is reached
for the largest K value you should extend the K.range and re-fit. The function will issue
a warning
to remind you of this.The Mix.list is a list with one element for each number of components tested. The content of each
Mix.list element is a matrix describing one particular fitted binomial mixture model. A fitted model
is characterized by two vectors (rows) denoted Detect.prob and Mixing.prop. Detect.prob
are the estimated detection probabilities, sorted in ascending order. The Mixing.prop are the
corresponding mixing proportions. A mixing proportion is the proportion of the gene clusters having the
corresponding detection probability.The generic functions plot.Binomix
and summary.Binomix
are available for Binomix
objects.
Central to the concept is the idea that every gene has a detection probability, i.e. a probability of being present in a genome. Genes who are always present in all genomes are called core genes, and these should have a detection probability of 1.0. Other genes are only present in a subset of the genomes, and these have smaller detection probabilities. Some genes are only present in one single genome, denoted ORFan genes, and an unknown number of genes have yet to be observed. If the number of genomes investigated is large these latter must have a very small detection probability.
A binomial mixture model with K components estimates K detection probabilities from the
data. The more components you choose, the better you can fit the (present) data, at the cost of less
precision in the estimates due to less degrees of freedom. binomixEstimate
allows you to
fit several models, and the input K.range specifies which values of K to try out. There no
real point using K less than 3, and the default is K.range=3:5. In general, the more genomes
you have the larger you can choose K without overfitting. Computations will be slower for larger
values of K. In order to choose the optimal value for K, binomixEstimate
computes the BIC-criterion, see below.
As the number of genomes grow, we tend to observe an increasing number of gene clusters. Once a K-component binomial mixture has been fitted, we can estimate the number of gene clusters not yet observed, and thereby the pan-genome size. Also, as the number of genomes grows we tend to observe fewer core genes. The fitted binomial mixture model also gives an estimate of the final number of core gene clusters, i.e. those still left after having observed infinite many genomes.
The detection probability of core genes should be 1.0, but can at times be set fractionally smaller. This means you accept that even core genes are not always detected in every genome, e.g. they may be there, but your gene prediction has missed them. Notice that setting the core.detect.prob to less than 1.0 may affect the core gene size estimate dramatically.
Snipen, L., Almoy, T., Ussery, D.W. (2009). Microbial comparative pan-genomics using binomial mixture models. BMC Genomics, 10:385.
Snipen, L., Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications to Escherichia coli. F1000 Research, 1:19.
Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461-464.
panMatrix
, chao
, plot.Binomix
,
summary.Binomix
.
# Loading a Panmat object in the micropan package
data(list="Mpneumoniae.blast.panmat",package="micropan")
# Estimating binomial mixture models
bino <- binomixEstimate(Mpneumoniae.blast.panmat,K.range=3:8) # using 3,4,...,8 components
print(bino$BIC.table) # minimum BIC at 3 components
# Plotting the optimal model, and printing the summary
plot(bino)
summary(bino)
# Plotting the 8-component model as well
plot(bino,ncomp=8) # clearly overfitted, we do not need this many sectors
# Plotting the distribution in a single genome
plot(bino,type="single") # completely dominated by core genes
Run the code above in your browser using DataLab