flexdog to multiple SNP's.This is a convenience function that will run flexdog over many SNP's.
Support is provided for parallel computing through the doParallel package.
This function has not been extensively tested. Please report any bugs to
http://github.com/dcgerard/updog/issues.
multidog(
refmat,
sizemat,
ploidy,
model = c("norm", "hw", "bb", "ash", "s1", "s1pp", "f1", "f1pp", "flex", "uniform",
"custom"),
nc = 1,
p1_id = NULL,
p2_id = NULL,
bias_init = exp(c(-1, -0.5, 0, 0.5, 1)),
outliers = FALSE,
prior_vec = NULL,
...
)A matrix of reference read counts. The columns index
the individuals and the rows index the markers (SNP's). This matrix must have
rownames (for the names of the markers) and column names (for the names
of the individuals). These names must match the names in sizemat.
A matrix of total read counts. The columns index
the individuals and the rows index the markers (SNP's). This matrix must have
rownames (for the names of the markers) and column names (for the names
of the individuals). These names must match the names in refmat.
The ploidy of the species. Assumed to be the same for each individual.
What form should the prior (genotype distribution) take? See Details for possible values.
The number of computing cores to use. This should never be
more than the number of cores available in your computing environment.
You can determine the maximum number of available cores by running
parallel::detectCores() in R.
The ID of the first parent. This should be a character of
length 1. This should correspond to a single column name in refmat
and sizemat.
The ID of the second parent. This should be a character of
length 1. This should correspond to a single column name in refmat
and sizemat.
A vector of initial values for the bias parameter
over the multiple runs of flexdog_full.
A logical. Should we allow for the inclusion of outliers
(TRUE) or not (FALSE). Only supported when
model = "f1" or model = "s1". I wouldn't
recommend it for any other model anyway.
The pre-specified genotype distribution. Only used if
model = "custom" and must otherwise be NULL. If specified,
then it should be a vector of length ploidy + 1 with
non-negative elements that sum to 1.
Additional parameters to pass to flexdog_full.
A list-like object of two data frames.
snpdfA data frame containing properties of the SNP's (markers). The rows index the SNP's. The variables include:
snpThe name of the SNP (marker).
biasThe estimated allele bias of the SNP.
seqThe estimated sequencing error rate of the SNP.
odThe estimated overdispersion parameter of the SNP.
prop_misThe estimated proportion of individuals misclassified in the SNP.
num_iterThe number of iterations performed during the EM algorithm for that SNP.
llikeThe maximum marginal likelihood of the SNP.
ploidyThe provided ploidy of the species.
modelThe provided model for the prior genotype distribution.
Pr_kThe estimated frequency of individuals with genotype k, where k can be any integer between 0 and the ploidy level.
See the return value of
par in the help page of flexdog.
inddfA data frame containing the properties of the individuals at each SNP. The variables include:
snpThe name of the SNP (marker).
indThe name of the individual.
refThe provided reference counts for that individual at that SNP.
sizeThe provided total counts for that individual at that SNP.
genoThe posterior mode genotype for that individual at that SNP. This is the estimated reference allele dosage for a given individual at a given SNP.
postmeanThe posterior mean genotype for that individual at that SNP. This is a continuous genotype estimate of the reference allele dosage for a given individual at a given SNP.
maxpostprobThe maximum posterior probability. This is the posterior probability that the individual was genotyped correctly.
Pr_kThe posterior probability that a given individual at a given SNP has genotype k, where k can vary from 0 to the ploidy level of the species.
You should format your reference counts and total read counts in two separate matrices. The rows should index the markers (SNP's) and the columns should index the individuals. Row names are how we ID the SNP's and column names are how we ID the individuals, and so they are required attributes.
See the details of flexdog for the possible values of
model.
If model = "f1", model = "f1pp", model = "s1",
or model = "s1pp" then the user may provide the individual ID
for parent(s) via the p1_id and p2_id arguments.
The output is a list containing two data frames. The first data frame,
called snpdf, contains information on each SNP, such as the allele bias
and the sequencing error rate. The second data frame, called inddf,
contains information on each individual at each SNP, such as the estimated
genotype and the posterior probability of being classified correctly.
Using an nc value greater than 1 will allow you to
run flexdog in parallel. Only set nc greater than
1 if you are sure you have access to the proper number of cores.
The upper bound on the value of nc you should try can be determined
by running parallel::detectCores() in R.
SNP's that contain 0 reads (or all missing data) are entirely removed.
# NOT RUN {
data("uitdewilligen")
mout <- multidog(refmat = t(uitdewilligen$refmat),
sizemat = t(uitdewilligen$sizemat),
ploidy = uitdewilligen$ploidy,
nc = 2)
mout$inddf
mout$snpdf
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab