HindHe: Identify Non-Mendelian Loci and Taxa that Deviate from Ploidy Expectations

Description

HindHe and HindHeMapping both generate a matrix of values, with taxa in rows and loci in columns. The mean value of the matrix is expected to be a certain value depending on the ploidy and, in the case of natural populations and diversity panels, the inbreeding coefficient. colMeans of the matrix can be used to filter non-Mendelian loci from the dataset. rowMeans of the matrix can be used to identify taxa that are not the expected ploidy, are interspecific hybrids, or are a mix of multiple samples.

Usage

HindHe(object, ...)
# S3 method for RADdata
HindHe(object, omitTaxa = GetBlankTaxa(object), ...)
HindHeMapping(object, ...)
# S3 method for RADdata
HindHeMapping(object, n.gen.backcrossing = 0, n.gen.intermating = 0,
              n.gen.selfing = 0, ploidy = object$possiblePloidies[[1]],
              minLikelihoodRatio = 10,
              omitTaxa = c(GetDonorParent(object), GetRecurrentParent(object), 
                           GetBlankTaxa(object)), ...)

Value

A named matrix, with taxa in rows and loci in columns. For HindHeMapping, loci are omitted if consistent parental genotypes could not be determined across alleles.

Arguments

object: A RADdata object. Genotype calling does not need to have been performed yet. If the population is a mapping population, SetDonorParent and SetRecurrentParent should have been run already.
omitTaxa: A character vector indicating names of taxa not to be included in the output. For HindHe, these taxa will also be omitted from allele frequency estimations.
n.gen.backcrossing: The number of generations of backcrossing performed in a mapping population.
n.gen.intermating: The number of generations of intermating performed in a mapping population. Included for consistency with PipelineMapping2Parents, but currently will give an error if set to any value other than zero. If the most recent generation in your mapping population was random mating among all progeny, use HindHe instead of HindHeMapping.
n.gen.selfing: The number of generations of self-fertilization performed in a mapping population.
ploidy: A single value indicating the assumed ploidy to test. Currently, only autopolyploid and diploid inheritance modes are supported.
minLikelihoodRatio: Used internally by EstimateParentalGenotypes as a threshold for certainty of parental genotypes. Decrease this value if too many markers are being discarded from the calculation.
...: Additional arguments (none implemented).

Author

Lindsay V. Clark

Details

These functions are especially useful for highly duplicated genomes, in which RAD tag alignments may have been incorrect, resulting in groups of alleles that do not represent true Mendelian loci. The statistic that is calculated is based on the principle that observed heterozygosity will be higher than expected heterozygosity if a "locus" actually represents two or more collapsed paralogs. However, the statistic uses read depth in place of genotypes, eliminating the need to perform genotype calling before filtering.

For a given taxon * locus, $H_{ind}$ is the probability that two sequencing reads, sampled without replacement, are different alleles (RAD tags).

In HindHe, $H_E$ is the expected heterozygosity, estimated from allele frequencies by taking the column means of object$depthRatios. This is also the estimated probability that if two alleles were sampled at random from the population at a given locus, they would be different alleles.

In HindHeMapping, $H_E$ is the average probability that in a random progeny, two alleles sampled without replacement would be different. The number of generations of backcrossing and self-fertilization, along with the ploidy and estimated parental genotypes, are needed to make this calculation. The function essentially simulates the mapping population based on parental genotypes to determine $H_E$.

The expectation is that

$$H_{ind}/H_E = \frac{ploidy - 1}{ploidy} * (1 - F)$$

in a diversity panel, where $F$ is the inbreeding coefficient, and

$$H_{ind}/H_E = \frac{ploidy - 1}{ploidy}$$

in a mapping population. Loci that have much higher average values likely represent collapsed paralogs that should be removed from the dataset. Taxa with much higher average values may be higher ploidy than expected, interspecific hybrids, or multiple samples mixed together.

References

Clark, L. V., Mays, W., Lipka, A. E. and Sacks, E. J. (2022) A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes. BMC Bioinformatics 23, 101, doi:10.1186/s12859-022-04635-9.

A seminar describing $H_{ind}/H_E$ is available at https://youtu.be/Z2xwLQYc8OA?t=1678.

Examples

Run this code

data(exampleRAD)

hhmat <- HindHe(exampleRAD)
colMeans(hhmat, na.rm = TRUE) # near 0.5 for diploid loci, 0.75 for tetraploid loci

data(exampleRAD_mapping)
exampleRAD_mapping <- SetDonorParent(exampleRAD_mapping, "parent1")
exampleRAD_mapping <- SetRecurrentParent(exampleRAD_mapping, "parent2")

hhmat2 <- HindHeMapping(exampleRAD_mapping, n.gen.backcrossing = 1)
colMeans(hhmat2, na.rm = TRUE) # near 0.5; all loci diploid

Run the code above in your browser using DataLab