binomixEstimate: Binomial mixture model estimates

Description

Fits binomial mixture models to the data given as a pan-matrix. From the fitted models both estimates of pan-genome size and core-genome size are available.

Usage

binomixEstimate(
  pan.matrix,
  K.range = 3:5,
  core.detect.prob = 1,
  verbose = TRUE
)

Arguments

pan.matrix

A pan-matrix, see panMatrix for details.

K.range

The range of model complexities to explore. The vector of integers specify the number of binomial densities to combine in the mixture models.

core.detect.prob

The detection probability of core genes. This should almost always be 1.0, since a core gene is by definition always present in all genomes, but can be set fractionally smaller.

verbose

Logical indicating if textual output should be given to monitor the progress of the computations.

Value

binomixEstimate returns a list with two components, the BIC.tbl and Mix.tbl.

The BIC.tbl is a tibble listing, in each row, the results for each number of components used, given by the input K.range. The column Core.size is the estimated number of core gene families, the column Pan.size is the estimated pan-genome size. The column BIC is the Bayesian Information Criterion (Schwarz, 1978) that should be used to choose the optimal component number (K). The number of components where BIC is minimized is the optimum. If minimum BIC is reached for the largest K value you should extend the K.range to larger values and re-fit. The function will issue a warning to remind you of this.

The Mix.tbl is a tibble with estimates from the mixture models. The column Component indicates the model, i.e. all rows where Component has the same value are from the same model. There will be 3 rows for 3-component model, 4 rows for 4-component, etc. The column Detection.prob contain the estimated detection probabilities for each component of the mixture models. A Mixing.proportion is the proportion of the gene clusters having the corresponding Detection.prob, i.e. if core genes have Detection.prob 1.0, the corresponding Mixing.proportion (same row) indicates how large fraction of the gene families are core genes.

Details

A binomial mixture model can be used to describe the distribution of gene clusters across genomes in a pan-genome. The idea and the details of the computations are given in Hogg et al (2007), Snipen et al (2009) and Snipen & Ussery (2012).

Central to the concept is the idea that every gene has a detection probability, i.e. a probability of being present in a genome. Genes who are always present in all genomes are called core genes, and these should have a detection probability of 1.0. Other genes are only present in a subset of the genomes, and these have smaller detection probabilities. Some genes are only present in one single genome, denoted ORFan genes, and an unknown number of genes have yet to be observed. If the number of genomes investigated is large these latter must have a very small detection probability.

A binomial mixture model with K components estimates K detection probabilities from the data. The more components you choose, the better you can fit the (present) data, at the cost of less precision in the estimates due to less degrees of freedom. binomixEstimate allows you to fit several models, and the input K.range specifies which values of K to try out. There no real point using K less than 3, and the default is K.range=3:5. In general, the more genomes you have the larger you can choose K without overfitting. Computations will be slower for larger values of K. In order to choose the optimal value for K, binomixEstimate computes the BIC-criterion, see below.

As the number of genomes grow, we tend to observe an increasing number of gene clusters. Once a K-component binomial mixture has been fitted, we can estimate the number of gene clusters not yet observed, and thereby the pan-genome size. Also, as the number of genomes grows we tend to observe fewer core genes. The fitted binomial mixture model also gives an estimate of the final number of core gene clusters, i.e. those still left after having observed ‘infinite’ many genomes.

The detection probability of core genes should be 1.0, but can at times be set fractionally smaller. This means you accept that even core genes are not always detected in every genome, e.g. they may be there, but your gene prediction has missed them. Notice that setting the core.detect.prob to less than 1.0 may affect the core gene size estimate dramatically.

References

Hogg, J.S., Hu, F.Z, Janto, B., Boissy, R., Hayes, J., Keefe, R., Post, J.C., Ehrlich, G.D. (2007). Characterization and modeling of the Haemophilus influenzae core- and supra-genomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biology, 8:R103.

Snipen, L., Almoy, T., Ussery, D.W. (2009). Microbial comparative pan-genomics using binomial mixture models. BMC Genomics, 10:385.

Snipen, L., Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications to Escherichia coli. F1000 Research, 1:19.

Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461-464.

Examples

Run this code

# NOT RUN {
# Loading an example pan-matrix
data(xmpl.panmat)

# Estimating binomial mixture models
binmix.lst <- binomixEstimate(xmpl.panmat, K.range = 3:8)
print(binmix.lst$BIC.tbl) # minimum BIC at 3 components

# }
# NOT RUN {
# The pan-genome gene distribution as a pie-chart
library(ggplot2)
ncomp <- 3
binmix.lst$Mix.tbl %>% 
  filter(Components == ncomp) %>% 
  ggplot() +
  geom_col(aes(x = "", y = Mixing.proportion, fill = Detection.prob)) +
  coord_polar(theta = "y") +
  labs(x = "", y = "", title = "Pan-genome gene distribution") +
  scale_fill_gradientn(colors = c("pink", "orange", "green", "cyan", "blue"))
  
# The distribution in an average genome
binmix.lst$Mix.tbl %>% 
  filter(Components == ncomp) %>% 
  mutate(Single = Mixing.proportion * Detection.prob) %>%
  ggplot() +
  geom_col(aes(x = "", y = Single, fill = Detection.prob)) +
  coord_polar(theta = "y") +
  labs(x = "", y = "", title = "Average genome gene distribution") +
  scale_fill_gradientn(colors = c("pink", "orange", "green", "cyan", "blue"))
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab