mupdog: Multi-SNP updog.

Description

A method to genotype autopolyploids using GBS or RAD-seq like data by accounting for correlations in the genotype distribution between the individuals.

Usage

mupdog(
  refmat,
  sizemat,
  ploidy,
  verbose = TRUE,
  mean_bias = 0,
  var_bias = 1,
  mean_seq = -4.7,
  var_seq = 1,
  seq = NULL,
  bias = NULL,
  od = NULL,
  allele_freq = NULL,
  inbreeding = NULL,
  cor_mat = NULL,
  postmean = NULL,
  postvar = NULL,
  update_cor = TRUE,
  update_inbreeding = TRUE,
  update_allele_freq = TRUE,
  num_core = 1,
  update_method = c("Brent", "L-BFGS-B"),
  control = list()
)

Arguments

refmat

A matrix of reference counts. The rows index the individuals and the columns index the SNPs.

sizemat

A matrix of total counts. The rows index the individuals and the columns index the SNPs. Should have the same dimensions as refmat.

ploidy

The ploidy of the species. To estimate the ploidy, re-run mupdog at various ploidy levels and choose the one with the largest ELBO. This assumes that the ploidy is the same for all individuals in the sample.

verbose

Should we print a lot of output TRUE or not FALSE?

mean_bias

The prior mean of the log-bias. Defaults to 0 (no bias).

var_bias

The prior variance on the log-bias. Defaults to 1. This roughly corresponds to likely bias values between 0.14 and 7.4. This is a far wider interval than what we observe in practice, thus making this prior rather uninformative. We usually observe bias values somewhere between 0.5 and 2.

mean_seq

The prior mean of the logit-sequencing-error-rate. Defaults to -4.7. This corresponds to a sequencing error rate of 0.009.

var_seq

The prior variance of the logit-sequencing-error-rate. Defaults to 1. This corresponds to likely values of 0.001 and 0.06. This upper bound is larger than what we would expect given the current state of next-gen-sequencing technology.

seq

A vector of initial sequencing errors. Should be the same length as the number of columns of refmat (number of SNPs). Must be between 0 and 1.

bias

A vector of initial bias parameters. Should be the same length as the number of columns of refmat (number of SNPs). Must be greater than 0.

A vector of initial overdispersion parameters. Should be the same length as the number of columns of refmat (number of SNPs). Must be between 0 and 1.

allele_freq

A vector of initial allele frequencies. Should be the same length as the number of columns of refmat (number of SNPs). Must be between 0 and 1.

inbreeding

A vector of initial individual-specific inbreeding coefficients. Should be the same length as the number of rows of refmat (number of individuals). Must be between 0 and 1.

cor_mat

Initial correlation matrix. Should have the same number of columns/rows as the number of individuals.

postmean

Initial variational posterior means. Should have the same dimensions as refmat.

postvar

Initial posterior variances. Should have the same dimensions as refmat.

update_cor

A logical. Should we update the underlying correlation matrix TRUE or not FALSE. It might be unwise to set this to TRUE if you have more individuals than SNPs.

update_inbreeding

A logical. Should we update the inbreeding coefficients TRUE or not FALSE?

update_allele_freq

A logical. Should we update the allele frequencies TRUE or not FALSE?

num_core

The number of cores to use if you want to run the optimization steps in parallel. If num_core = 1, then the optimization step will not be run in parallel.

update_method

What generic optimizer should we use to update allele_freq and inbreeding? Options are either "Brent" or "L-BFGS-B". See optim for details on these optimizers.

control

A list of control parameters (itermax, obj_tol).

Value

A list with some or all of the following elements:

map_dosage: A matrix of numerics containing the variational maximum-a-posterior (MAP) genotypes (allele dosages) for each individual at each SNP. Element (i, j) is the MAP genotype for individual i at SNP j.
maxpostprob: A matrix of numerics containing the variational maximum posterior probabilities for each individual at each SNP. The (i, j)th element is the variational posterior probability that individual i is genotyped correctly at SNP j.
postprob: A three-way array of numerics. Element (i, j, k) is the variational posterior probability that individual i has genotype k-1 at SNP j.
seq: A vector of numerics. Element j is the estimated sequencing error rate for SNP j.
bias: A vector of numerics. Element j is the estimated allelic bias for SNP j.
od: A vector of numerics. Element j is the estimated overdispersion parameter for SNP j.
allele_freq: A vector of numerics. Element j is the estimated allele-frequency for SNP j.
inbreeding: A vector of numerics. Element i is the estimated inbreeding coefficient for individual i.
cor_mat: A symmetric matrix of numerics. Element (i, j) is the estimated _latent_ correlation between individual i and individual j.
postmean: A matrix of numerics. Element (i, j) is the variational posterior mean for individual i at SNP j.
postvar: A matrix of numerics. Element (i, j) is the variational posterior variance for individual i at SNP j.
input$refmat: A matrix of numerics. The inputted refmat.
input$sizemat: A matrix of numerics. The inputted sizemat.
input$ploidy: The inputted ploidy.
obj: The maximized variational objective.

Details

Blischak et al (2017) developed a genotyping approach for autopolyploids that assumes a Balding-Nichols generative model (Balding and Nichols, 1997) on the genotypes. Using a different generative model, Gerard et al (2018) accounted for common issues in sequencing data ignored by previous researchers. Mupdog unites and extends these two approaches:

Unite: We account for locus-specific allele-bias, locus-specific sequencing error, and locus-specific overdispersion while marginally assuming a Balding-Nichols generative model on the genotypes.
Extend: We account for underlying correlations between the individuals using a Gaussian copula model.

Mupdog uses a variational Bayes approach to estimate all parameters of interest and the posterior probabilities of the genotypes for each individual at each locus.

References

David J Balding and Richard A Nichols. Significant genetic correlations among caucasians at forensic DNA loci. Heredity, 78(6):583<U+2013>589, 1997. doi: 10.1038/sj.hdy.6881750.
Paul D Blischak, Laura S Kubatko, and Andrea D Wolfe. SNP genotyping and parameter estimation in polyploids using low-coverage sequencing data. Bioinformatics, page btx587, 2017. doi: 10.1093/bioinformatics/btx587.
Gerard, D., Ferr<U+00E3>o, L. F. V., Garcia, A. A. F., & Stephens, M. (2018). Genotyping Polyploids from Messy Sequencing Data. Genetics, 210(3), 789-807. doi: 10.1534/genetics.118.301468.

Examples

Run this code

# NOT RUN {
data(uitdewilligen)
mout <- mupdog(refmat = uitdewilligen$refmat,
               sizemat = uitdewilligen$sizemat,
               ploidy = uitdewilligen$ploidy,
               verbose = FALSE,
               control = list(obj_tol = 10^-4))

## Summaries of output
plot(mout, 4)
hist(mout$bias)
hist(mout$seq)
hist(mout$od)
hist(mout$inbreeding)
hist(mout$allele_freq)

## mupdog can correctly estimate ploidy to be 4
mout2 <- mupdog(refmat = uitdewilligen$refmat,
                sizemat = uitdewilligen$sizemat,
                ploidy = 2,
                verbose = FALSE,
                control = list(obj_tol = 10^-4))

mout6 <- mupdog(refmat = uitdewilligen$refmat,
                sizemat = uitdewilligen$sizemat,
                ploidy = 6,
                verbose = FALSE,
                control = list(obj_tol = 10^-4))

mout8 <- mupdog(refmat = uitdewilligen$refmat,
                sizemat = uitdewilligen$sizemat,
                ploidy = 8,
                verbose = FALSE,
                control = list(obj_tol = 10^-4))

y <- c(mout2$obj, mout$obj, mout6$obj, mout8$obj)
x <- seq(2, 8, by = 2)
plot(x, y, type = "l", xlab = "ploidy", ylab = "objective")
# }
# NOT RUN {

# }

Run the code above in your browser using DataLab