HindHe
and HindHeMapping
both generate a matrix of values, with
taxa in rows and loci in columns. The mean value of the matrix is expected to
be a certain value depending on the ploidy and, in the case of natural
populations and diversity panels, the inbreeding coefficient. colMeans
of the matrix can be used to filter non-Mendelian loci from the dataset.
rowMeans
of the matrix can be used to identify taxa that are not the
expected ploidy, are interspecific hybrids, or are a mix of multiple samples.
HindHe(object, ...)# S3 method for RADdata
HindHe(object, omitTaxa = GetBlankTaxa(object), ...)
HindHeMapping(object, ...)
# S3 method for RADdata
HindHeMapping(object, n.gen.backcrossing = 0, n.gen.intermating = 0,
n.gen.selfing = 0, ploidy = object$possiblePloidies[[1]],
minLikelihoodRatio = 10,
omitTaxa = c(GetDonorParent(object), GetRecurrentParent(object),
GetBlankTaxa(object)), ...)
A named matrix, with taxa in rows and loci in columns. For HindHeMapping
,
loci are omitted if consistent parental genotypes could not be determined across
alleles.
A RADdata
object. Genotype calling does not need to have been
performed yet. If the population is a mapping population,
SetDonorParent
and SetRecurrentParent
should have
been run already.
A character vector indicating names of taxa not to be included in the output.
For HindHe
, these taxa will also be omitted from allele frequency
estimations.
The number of generations of backcrossing performed in a mapping population.
The number of generations of intermating performed in a mapping population.
Included for consistency with PipelineMapping2Parents
, but
currently will give an error if set to any value other than zero. If the most
recent generation in your mapping population was random mating among all
progeny, use HindHe
instead of HindHeMapping
.
The number of generations of self-fertilization performed in a mapping population.
A single value indicating the assumed ploidy to test. Currently, only autopolyploid and diploid inheritance modes are supported.
Used internally by EstimateParentalGenotypes
as a threshold for
certainty of parental genotypes. Decrease this value if too many markers are
being discarded from the calculation.
Additional arguments (none implemented).
Lindsay V. Clark
These functions are especially useful for highly duplicated genomes, in which RAD tag alignments may have been incorrect, resulting in groups of alleles that do not represent true Mendelian loci. The statistic that is calculated is based on the principle that observed heterozygosity will be higher than expected heterozygosity if a "locus" actually represents two or more collapsed paralogs. However, the statistic uses read depth in place of genotypes, eliminating the need to perform genotype calling before filtering.
For a given taxon * locus, \(H_{ind}\) is the probability that two sequencing reads, sampled without replacement, are different alleles (RAD tags).
In HindHe
, \(H_E\) is the expected heterozygosity, estimated from
allele frequencies by taking the column means of object$depthRatios
.
This is also the estimated probability that if two alleles were sampled at
random from the population at a given locus, they would be different alleles.
In HindHeMapping
, \(H_E\) is the average probability that in
a random progeny, two alleles sampled without replacement would be different.
The number of generations of backcrossing and self-fertilization, along with the
ploidy and estimated parental genotypes, are needed to make this calculation.
The function essentially simulates the mapping population based on parental
genotypes to determine \(H_E\).
The expectation is that
$$H_{ind}/H_E = \frac{ploidy - 1}{ploidy} * (1 - F)$$
in a diversity panel, where \(F\) is the inbreeding coefficient, and
$$H_{ind}/H_E = \frac{ploidy - 1}{ploidy}$$
in a mapping population. Loci that have much higher average values likely represent collapsed paralogs that should be removed from the dataset. Taxa with much higher average values may be higher ploidy than expected, interspecific hybrids, or multiple samples mixed together.
Clark, L. V., Mays, W., Lipka, A. E. and Sacks, E. J. (2022) A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes. BMC Bioinformatics 23, 101, doi:10.1186/s12859-022-04635-9.
A seminar describing \(H_{ind}/H_E\) is available at https://youtu.be/Z2xwLQYc8OA?t=1678.
InbreedingFromHindHe
,
ExpectedHindHe
data(exampleRAD)
hhmat <- HindHe(exampleRAD)
colMeans(hhmat, na.rm = TRUE) # near 0.5 for diploid loci, 0.75 for tetraploid loci
data(exampleRAD_mapping)
exampleRAD_mapping <- SetDonorParent(exampleRAD_mapping, "parent1")
exampleRAD_mapping <- SetRecurrentParent(exampleRAD_mapping, "parent2")
hhmat2 <- HindHeMapping(exampleRAD_mapping, n.gen.backcrossing = 1)
colMeans(hhmat2, na.rm = TRUE) # near 0.5; all loci diploid
Run the code above in your browser using DataLab