If allele frequencies are not already recorded in object
, they will
be added using AddAlleleFreqHWE
. Allele frequencies are then
used for estimating the probability of sampling an allele from a genotype due
to sample contamination. Given a known genotype with \(x\) copies of
allele \(i\), ploidy \(k\), allele frequency \(p_i\) in the population used for
making sequencing libraries, and contamination rate \(c\), the probabiity of
sampling a read \(r_i\) from that locus corresponding to that allele is
$$P(r_i | x) = \frac{x}{k} * (1 - c) + p_i * c$$
To estimate the genotype likelihood, where \(nr_i\) is the number of reads
corresponding to allele \(i\) for a given taxon and locus and \(nr_j\) is the
number of reads corresponding to all other alleles for that taxon and locus:
$$P(nr_i, nr_j | x) = {{nr_i + nr_j}\choose{nr_i}} * \frac{B[P(r_i | x) * d + nr_i, [1 - P(r_i | x)] * d + nr_j]]}{B[P(r_i | x) * d, [1 - P(r_i | x)] * d]}$$
where
$${{nr_i + nr_j}\choose{nr_i}} = \frac{(nr_i + nr_j)!}{nr_i! * nr_j!}$$
B is the beta function, and \(d\) is the overdispersion parameter set by
overdispersion
.