Genotype polyploid individuals from next generation
sequencing (NGS) data while assuming the genotype distribution is one of
several forms. flexdog
does this while accounting for allele bias,
overdispersion, sequencing error, and possibly outlying observations
(if model = "f1"
or model = "s1"
). This function has more
options than flexdog
and is only meant for expert users.
The method is described in detail in Gerard et. al. (2018) and
Gerard and Ferr<U+00E3>o (2019).
flexdog_full(
refvec,
sizevec,
ploidy,
model = c("norm", "hw", "bb", "ash", "s1", "s1pp", "f1", "f1pp", "flex", "uniform",
"custom"),
verbose = TRUE,
mean_bias = 0,
var_bias = 0.7^2,
mean_seq = -4.7,
var_seq = 1,
mean_od = -5.5,
var_od = 0.5^2,
seq = 0.005,
bias = 1,
od = 0.001,
update_bias = TRUE,
update_seq = TRUE,
update_od = TRUE,
mode = NULL,
use_cvxr = FALSE,
itermax = 200,
tol = 10^-4,
fs1_alpha = 10^-3,
ashpen = 10^-6,
p1ref = NULL,
p1size = NULL,
p2ref = NULL,
p2size = NULL,
snpname = NULL,
outliers = FALSE,
prior_vec = NULL
)
A vector of counts of reads of the reference allele.
A vector of total counts.
The ploidy of the species. Assumed to be the same for each individual.
What form should the prior (genotype distribution) take? See Details for possible values.
Should we output more (TRUE
) or less
(FALSE
)?
The prior mean of the log-bias.
The prior variance of the log-bias.
The prior mean of the logit of the sequencing error rate.
The prior variance of the logit of the sequencing error rate.
The prior mean of the logit of the overdispersion parameter.
The prior variance of the logit of the overdispersion parameter.
The starting value of the sequencing error rate.
The starting value of the bias.
The starting value of the overdispersion parameter.
A logical. Should we update bias
(TRUE
), or not (FALSE
)?
A logical. Should we update seq
(TRUE
), or not (FALSE
)?
A logical. Should we update od
(TRUE
), or not (FALSE
)?
The mode if model = "ash"
. If not provided,
flexdog
will estimate the mode. This is the starting point of
the allele frequency if model = "hw"
. This should be
NULL
for all other options of model
.
A logical. If model = "ash"
, then do you want
to use the EM algorithm
(FALSE
) or a convex optimization program using
the package CVXR (TRUE
)?
Only available if CVXR is installed. Setting use_cvxr
to
TRUE
is generally slower than setting it to FALSE
.
The maximum number of EM iterations to run for each mode
(if model = "ash"
) or the total number of EM iterations to
run (for any other value of model
).
The tolerance stopping criterion. The EM algorithm will stop
if the difference in the log-likelihoods between two consecutive
iterations is less than tol
.
The value at which to fix
the mixing proportion for the uniform component
when model = "f1"
, model = "f1pp"
,
model = "s1"
, or model = "s1pp"
.
I would recommend some small
value such as 10^-3
.
The penalty to put on the unimodal prior. Larger values shrink the unimodal prior towards the discrete uniform distribution.
The reference counts for the first parent if
model = "f1"
(or model = "f1pp"
), or for
the only parent if model = "s1"
(or model = "s1pp"
).
The total counts for the first parent if
model = "f1"
(or model = "f1pp"
),
or for the only parent if model = "s1"
(or model = "s1pp"
).
The reference counts for the second parent if
model = "f1"
(or model = "f1pp"
).
The total counts for the second parent if
model = "f1"
(or model = "f1pp"
).
A string. The name of the SNP under consideration.
This is just returned in the input
list for your reference.
A logical. Should we allow for the inclusion of outliers
(TRUE
) or not (FALSE
). Only supported when
model = "f1"
or model = "s1"
. I wouldn't
recommend it for any other model anyway.
The pre-specified genotype distribution. Only used if
model = "custom"
and must otherwise be NULL
. If specified,
then it should be a vector of length ploidy + 1
with
non-negative elements that sum to 1.
An object of class flexdog
, which consists
of a list with some or all of the following elements:
bias
The estimated bias parameter.
seq
The estimated sequencing error rate.
od
The estimated overdispersion parameter.
num_iter
The number of EM iterations ran. You should
be wary if this equals itermax
.
llike
The maximum marginal log-likelihood.
postmat
A matrix of posterior probabilities of each genotype for each individual. The rows index the individuals and the columns index the allele dosage.
gene_dist
The estimated genotype distribution. The
i
th element is the proportion of individuals with
genotype i-1
. If outliers = TRUE
, then this
is conditional on the point not being an outlier.
par
A list of the final estimates of the parameters
of the genotype distribution. The elements included in par
depends on the value of model
:
model = "norm"
:mu
is the normal mean and sigma
is the normal standard deviation (not variance).
model = "hw"
:alpha
is the major allele frequency.
model = "bb"
:alpha
is the major allele frequency and
tau
is the overdispersion parameter (see the description of
rho
in the Details of betabinom
).
model = "ash"
:par
is an empty list.
model = "s1"
:pgeno
is the allele dosage of the parent and
alpha
is the mixture proportion of the discrete uniform (included and
fixed at a small value mostly for numerical stability reasons). See the description
of fs1_alpha
in flexdog_full
.
model = "s1pp"
:pgeno
is the allele dosage of the parent and
p1_pair_weights
contains a vector of mixing weights where element i
is the
mixing proportion for the segregation distribution in row i
of
get_bivalent_probs(ploidy)$probmat[get_bivalent_probs(ploidy)$lvec == pgeno, , drop = FALSE]
.
model = "f1"
:p1geno
is the allele dosage of the first parent,
p2geno
is the allele dosage of the second parent, and alpha
is the mixture proportion of the discrete uniform (included and
fixed at a small value mostly for numerical stability reasons). See the description
of fs1_alpha
in flexdog_full
.
model = "f1pp"
:p1geno
is the allele dosage of the first parent,
p2geno
is the allele dosage of the second parent,
p1_pair_weights
contains a vector of mixing weights where element i
is the
mixing proportion for the segregation distribution for parent 1 in row i
of
get_bivalent_probs(ploidy)$probmat[get_bivalent_probs(ploidy)$lvec == p1geno, , drop = FALSE]
,
and p2_pair_weights
contains a vector of mixing weights where element i
is the
mixing proportion for the segregation distribution for parent 2 in row i
of
get_bivalent_probs(ploidy)$probmat[get_bivalent_probs(ploidy)$lvec == p2geno, , drop = FALSE]
.
model = "flex"
:par
is an empty list.
model = "uniform"
:par
is an empty list.
model = "custom"
:par
is an empty list.
geno
The posterior mode genotype. These are your genotype estimates.
maxpostprob
The maximum posterior probability. This is equivalent to the posterior probability of correctly genotyping each individual.
postmean
The posterior mean genotype. In downstream association studies, you might want to consider using these estimates.
input$refvec
The value of refvec
provided by
the user.
input$sizevec
The value of sizevec
provided
by the user.
input$ploidy
The value of ploidy
provided
by the user.
input$model
The value of model
provided by
the user.
input$p1ref
The value of p1ref
provided by the user.
input$p1size
The value of p1size
provided by the user.
input$p2ref
The value of p2ref
provided by the user.
input$p2size
The value of p2size
provided by the user.
input$snpname
The value of snpname
provided by the user.
prop_mis
The posterior proportion of individuals genotyped incorrectly.
out_prop
The estimated proportion of points that
are outliers. Only available if outliers = TRUE
.
prob_out
The ith element is the posterior probability
that individual i is an outlier. Only available if
outliers = TRUE
.
Possible values of the genotype distribution (values of model
) are:
"norm"
A distribution whose genotype frequencies are proportional
to the density value of a normal with some mean and some standard deviation.
Unlike the "bb"
and "hw"
options, this will allow for
distributions both more and less dispersed than a binomial.
This seems to be the most robust to violations in modeling assumptions, and so is the
default. This prior class was developed in Gerard and Ferr<U+00E3>o (2019).
"hw"
A binomial distribution that results from assuming that the population is in Hardy-Weinberg equilibrium (HWE). This actually does pretty well even when there are minor to moderate deviations from HWE. Though it does not perform as well as the `"norm"` option when there are severe deviations from HWE.
"bb"
A beta-binomial distribution. This is an overdispersed
version of "hw"
and can be derived from a special case of the Balding-Nichols model.
"ash"
Any unimodal prior. This can sometimes be sensitive to violations
in modeling assumptions, but tends to work better than the "flex"
option.
This prior class was developed in Gerard and Ferr<U+00E3>o (2019).
"s1"
This prior assumes the individuals are
all full-siblings resulting
from one generation of selfing. I.e. there is only
one parent. This model assumes
a particular type of meiotic behavior: polysomic
inheritance with
bivalent, non-preferential pairing.
Since this is a pretty strong and well-founded prior,
we allow outliers = TRUE
when model = "s1"
.
"s1pp"
The same as "s1"
but accounts for possible
(and arbitrary levels of) preferential
pairing during meiosis. The only supported values of ploidy
right now for
this option are 4
and 6
.
"f1"
This prior assumes the individuals are all
full-siblings resulting
from one generation of a bi-parental cross.
This model assumes
a particular type of meiotic behavior: polysomic
inheritance with
bivalent, non-preferential pairing. Since this is
a pretty strong
and well-founded prior, we allow outliers = TRUE
when model = "f1"
.
"f1pp"
The same as "f1"
but accounts for possible
(and arbitrary levels of) preferential
pairing during meiosis. This option is mostly untested for values of ploidy
greater than 6
.
"flex"
Generically any categorical distribution. Theoretically, this works well if you have a lot of individuals. In practice, it seems to be much less robust to violations in modeling assumptions.
"uniform"
A discrete uniform distribution. This should never be used in practice.
"custom"
A pre-specified prior distribution. You specify
it using the prior_vec
argument. You should almost never
use this option in practice.
You might think a good default is model = "uniform"
because it is
somehow an "uninformative prior." But it is very informative and tends to
work horribly in practice. The intuition is that it will estimate
the allele bias and sequencing error rates so that the estimated genotypes
are approximately uniform (since we are assuming that they are approximately
uniform). This will usually result in unintuitive genotyping since most
populations don't have a uniform genotype distribution.
I include it as an option only for completeness. Please don't use it.
The value of prop_mis
is a very intuitive measure for
the quality of the SNP. prop_mis
is the posterior
proportion of individuals mis-genotyped. So if you want only SNPS
that accurately genotype, say, 95% of the individuals, you could
discard all SNPs with a prop_mis
over 0.05
.
The value of maxpostprob
is a very intuitive measure
for the quality of the genotype estimate of an individual.
This is the posterior probability of correctly genotyping
the individual when using geno
(the posterior mode)
as the genotype estimate. So if you want to correctly genotype,
say, 95% of individuals, you could discard all individuals
with a maxpostprob
of under 0.95
. However, if you are
just going to impute missing genotypes later, you might consider
not discarding any individuals as flexdog
's genotype estimates will
probably be more accurate than other more naive approaches, such
as imputing using the grand mean.
In most datasets I've examined, allelic bias is a major issue. However,
you may fit the model assuming no allelic bias by setting
update_bias = FALSE
and bias_init = 1
.
Prior to using flexdog
, during the read-mapping step,
you could try to get rid of allelic bias by
using WASP (https://doi.org/10.1101/011221). If you are successful
in removing the allelic bias (because its only source was the read-mapping
step), then setting update_bias = FALSE
and bias_init = 1
would be reasonable. You can visually inspect SNPs for bias by
using plot_geno
.
flexdog
, like most methods, is invariant to which allele you
label as the "reference" and which you label as the "alternative".
That is, if you set refvec
with the number of alternative
read-counts, then the resulting genotype estimates
will be the estimated allele dosage of the alternative allele.
Gerard, D., Ferr<U+00E3>o, L. F. V., Garcia, A. A. F., & Stephens, M. (2018). Genotyping Polyploids from Messy Sequencing Data. Genetics, 210(3), 789-807. doi: 10.1534/genetics.118.301468.
Gerard, D. and Ferr<U+00E3>o, L. F. V. (2019). Priors for Genotyping Polyploids. Bioinformatics. doi: 10.1093/bioinformatics/btz852.
Run browseVignettes(package = "updog")
in R for example usage.
Other useful functions include:
flexdog
For a more user-friendly version of
flexdog_full
.
rgeno
For simulating genotypes under the allowable
prior models in flexdog
.
rflexdog
For simulating read-counts under the
assumed likelihood model in flexdog
.
plot.flexdog
For plotting the output of
flexdog
.
oracle_mis
For calculating the oracle genotyping
error rates. This is useful for read-depth calculations
before collecting data. After you have data, using
the value of prop_mis
is better.
oracle_cor
For calculating the correlation between the true genotypes and an oracle estimator (useful for read-depth calculations before collecting data).
# NOT RUN {
## A natural population. We will assume a
## normal prior since there are so few
## individuals.
data("uitdewilligen")
ploidy <- 4
refvec <- uitdewilligen$refmat[, 1]
sizevec <- uitdewilligen$sizemat[, 1]
fout <- flexdog_full(refvec = refvec,
sizevec = sizevec,
ploidy = ploidy,
model = "norm")
plot(fout)
# }
Run the code above in your browser using DataLab