binomixEstimate(pan.matrix,K.range=3:5,core.detect.prob=1.0,verbose=TRUE)
Panmat
object, see panMatrix
for details.binomixEstimate
returns a Binomix
object, which is a small (S3) extension of a list
with two components. These two components are named BIC.table and Mix.list.The BIC.table is a matrix listing, in each row, the results for each number of components used, given by the input K.range. The column Core.size is the estimated number of core gene families, the column Pan.size is the estimated pan-genome size. The column BIC is the Bayesian Information Criterion (Schwarz, 1978) that should be used to choose the optimal value for K. The number of components where BIC is minimized is the optimal. If minimum BIC is reached for the largest K value you should extend the K.range and re-fit. The function will issue a warning
to remind you of this.The Mix.list is a list with one element for each number of components tested. The content of each Mix.list element is a matrix describing one particular fitted binomial mixture model. A fitted model is characterized by two vectors (rows) denoted Detect.prob and Mixing.prop. Detect.prob are the estimated detection probabilities, sorted in ascending order. The Mixing.prop are the corresponding mixing proportions. A mixing proportion is the proportion of the gene clusters having the corresponding detection probability.The generic functions plot.Binomix
, summary.Binomix
and str.Binomix
are available for Binomix
objects.
binomixEstimate
allows you to fit several models, and the input K.range specifies which values of K to try out. There no real point using K less than 3, and the default is K.range=3:5. In general, the more genomes you have the larger you can choose K without overfitting. Computations will be slower for larger values of K. In order to choose the optimal value for K, binomixEstimate
computes the BIC-criterion, see below.
As the number of genomes grow, we tend to observe an increasing number of gene clusters. Once a K-component binomial mixture has been fitted, we can estimate the number of gene clusters not yet observed, and thereby the pan-genome size. Also, as the number of genomes grows we tend to observe fewer core genes. The fitted binomial mixture model also gives an estimate of the final number of core gene clusters, i.e. those still left after having observed infinite many genomes.
The detection probability of core genes should be 1.0, but can at times be set fractionally smaller. This means you accept that even core genes are not always detected in every genome, e.g. they may be there, but your gene prediction has missed them. Notice that setting the core.detect.prob to less than 1.0 may affect the core gene size estimate dramatically.
Snipen, L., Almoy, T., Ussery, D.W. (2009). Microbial comparative pan-genomics using binomial mixture models. BMC Genomics, 10:385.
Snipen, L., Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications to Escherichia coli. F1000 Research, 1:19. Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461-464.
panMatrix
, chao
, plot.Binomix
, summary.Binomix
, str.Binomix
.
# Loading a Panmat object in the micropan package
data(list="Mpneumoniae.blast.panmat",package="micropan")
# Estimating binomial mixture models
bino <- binomixEstimate(Mpneumoniae.blast.panmat,K.range=3:8) # using 3,4,...,8 components
print(bino$BIC.table) # minimum BIC at 3 components
# Plotting the optimal model, and printing the summary
plot(bino)
summary(bino)
# Plotting the 8-component model as well
plot(bino,ncomp=8) # clearly overfitted, we do not need this many sectors
# Plotting the distribution in a single genome
plot(bino,type="single") # completely dominated by core genes
Run the code above in your browser using DataLab