A binomial mixture model can be used to describe the distribution of gene clusters across
genomes in a pan-genome. The idea and the details of the computations are given in Hogg et al (2007),
Snipen et al (2009) and Snipen & Ussery (2012).
Central to the concept is the idea that every gene has a detection probability, i.e. a probability of
being present in a genome. Genes who are always present in all genomes are called core genes, and these
should have a detection probability of 1.0. Other genes are only present in a subset of the genomes, and
these have smaller detection probabilities. Some genes are only present in one single genome, denoted
ORFan genes, and an unknown number of genes have yet to be observed. If the number of genomes investigated
is large these latter must have a very small detection probability.
A binomial mixture model with K components estimates K detection probabilities from the
data. The more components you choose, the better you can fit the (present) data, at the cost of less
precision in the estimates due to less degrees of freedom. binomixEstimate
allows you to
fit several models, and the input K.range specifies which values of K to try out. There no
real point using K less than 3, and the default is K.range=3:5. In general, the more genomes
you have the larger you can choose K without overfitting. Computations will be slower for larger
values of K. In order to choose the optimal value for K, binomixEstimate
computes the BIC-criterion, see below.
As the number of genomes grow, we tend to observe an increasing number of gene clusters. Once a
K-component binomial mixture has been fitted, we can estimate the number of gene clusters not yet
observed, and thereby the pan-genome size. Also, as the number of genomes grows we tend to observe fewer
core genes. The fitted binomial mixture model also gives an estimate of the final number of core gene
clusters, i.e. those still left after having observed ‘infinite’ many genomes.
The detection probability of core genes should be 1.0, but can at times be set fractionally smaller.
This means you accept that even core genes are not always detected in every genome, e.g. they may be
there, but your gene prediction has missed them. Notice that setting the core.detect.prob to less
than 1.0 may affect the core gene size estimate dramatically.