estimate_IBD: Generate IBD probabilities from marker genotypes and a phased linkage map

Description

estimate_IBD is a function for creating identity-by-descent (IBD) probabilities. Two computational methods are offered: by default IBD probabilites are estimated using hidden Markov models, but a heuristic method based on Bourke et al. (2014) is also included. Basic input data for this function are marker genotypes (either discrete marker dosages (ie scores 0, 1, ..., ploidy representing the number of copies of the marker allele), or the probabilities of these dosages) and a phased linkage map. Details on each of the methods are included under method

Usage

estimate_IBD(
  input_type = "discrete",
  genotypes,
  phased_maplist,
  method = "hmm",
  remove_markers = NULL,
  ploidy,
  ploidy2 = NULL,
  parent1 = "P1",
  parent2 = "P2",
  individuals = "all",
  log = NULL,
  map_function = "haldane",
  bivalent_decoding = TRUE,
  error = 0.01,
  full_multivalent_hexa = FALSE,
  verbose = FALSE,
  ncores = 1,
  fix_threshold = 0.1,
  factor_dist = 1
)

Value

A list of IBD probabilities, organised by linkage group (as given in the input phased_maplist). Each list item is itself a list containing the following:

IBDtype: The type of IBD; for this function only "genotypeIBD" are calculated.
IBDarray: A 3d array of IBD probabilities, with dimensions marker, genotype-class and F1 individual.
map: A 3-column data-frame specifying chromosome, marker and position (in cM)
parental_phase: Phasing of the markers in the parents, as given in the input phased_maplist
marginal.likelihoods: A list of marginal likelihoods of different valencies if method "hmm" was used, otherwise NULL
valency: The predicted valency that maximised the marginal likelihood, per offspring. For method "heur", NULL
offspring: Offspring names
biv_dec: Logical, whether bivalent decoding was used in the estimation of the F1 IBD probabilities.
gap: The size of the gap (in cM) used when interpolating the IBD probabilities. See function spline_IBD for details.
genocodes: Ordered list of genotype codes used to represent different genotype classes.
pairing: log likelihoods of each of the different pairing scenarios considered (can be used e.g. for post-mapping check of preferential pairing)
ploidy: ploidy of parent 1
ploidy2: ploidy of parent 2
method: The method used, either "hmm" (default) or "heur". See argument method
error: The error prior used, if method "hmm" was used, otherwise NULL

Arguments

input_type

Can be either one of 'discrete' or 'probabilistic'. For the former (default), dosage_matrix must be supplied, while for the latter probgeno_df must be supplied. Note that probabilistic genotypes can only be accepted if the method is default ('hmm').

genotypes

Marker genotypes, either a 2d matrix of integer marker scores or a data.frame of dosage probabilities. Details are as follows:

discrete :: If input_type is 'discrete', genotypes is a matrix of marker dosage scores with markers in rows and individuals in columns. Both (marker) rownames and (individual or sample) colnames are needed.

probabilistic :

If input_type is 'probabilistic', genotypes is a data frame as read from the scores file produced by function saveMarkerModels of R package fitPoly, or alternatively, a data frame containing at least the following columns:

SampleName :: Name of the sample (individual)

MarkerName :

Name of the marker

P0 :

Probabilities of dosage score '0'

P1, P2... etc. :

Probabilities of dosage score '1' etc. (up to max offspring dosage, e.g. P4 for tetraploid population)

phased_maplist

A list of phased linkage maps, the output of polymapR::create_phased_maplist

method

The method used to estimate IBD probabilities, either "hmm" or "heur". By default, the Hidden Markov Model (hmm) method is used. This uses an approach developed by Zheng et al (2016), and implemented in the 'TetraOrigin' package. However, unlike the original TetraOrigin software, it does not re-estimate parental linkage phase, as this is assumed to have been generated during map construction. Alternatively, a heuristic algorithm can be employed (method = "heur"), providing computational efficiency at higher ploidy levels (hexaploid, octoploid etc.), but at the cost of some accuracy. If method = "hmm" is specified, only diploid, triploid, autotetraploid and autohexaploid populations are currently allowed, while method = "heur" caters for all possible ploidy levels. Furthermore, the argument bivalent_decoding can only be set to FALSE in the case of the 'hmm' method (i.e. allowing for the possibility of multivalent formation and double reduction).

remove_markers

Optional vector of marker names to remove from the maps. Default is NULL.

ploidy

Integer. Ploidy of the organism.

ploidy2

Optional integer, by default NULL. Ploidy of parent 2, if different from parent 1.

parent1

Identifier of parent 1, by default assumed to be "P1"

parent2

Identifier of parent 2, by default assumed to be "P2"

individuals

By default "all" offspring are included, but otherwise a subset can be selected, using a vector of offspring indexing numbers (1,2, etc.) according to their order in dosage_matrix

log

Character string specifying the log filename to which standard output should be written. If NULL log is send to stdout.

map_function

Mapping function to use when converting map distances to recombination frequencies. Currently only "haldane" or "kosambi" are allowed.

bivalent_decoding

Option to consider only bivalent pairing during formation of gametes (ignored for diploid populations, as only bivalents possible there), by default TRUE

error

The (prior) probability of errors in the offspring dosages, usually assumed to be small but non-zero

full_multivalent_hexa

Option to allow multivalent pairing in both parents at the hexaploid level, by default FALSE. Note that if TRUE, a very large available RAM may be required (>= 32Gb) to process the data.

verbose

Logical, by default TRUE. Should progress messages be written?

ncores

How many CPU cores should be used in the evaluation? By default 1 core is used.

fix_threshold

If method = "heur", the threshold to fix the IBD probabilities while correcting for the sum of probabilities.

factor_dist

If method = "heur", the factor by which to increase or decrease the recombination frequencies as calculated from the map distances.

References

Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press.
Hackett et al. (2013) Linkage analysis and QTL mapping using SNP dosage data in a tetraploid potato mapping population. PLoS One 8(5): e63939
Zheng et al. (2016) Probabilistic multilocus haplotype reconstruction in outcrossing tetraploids. Genetics 203: 119-131
Bourke P.M. (2014) QTL analysis in polyploids: Model testing and power calculations. Wageningen University (MSc thesis)

Examples

Run this code

data("phased_maplist.4x", "SNP_dosages.4x")
estimate_IBD(phased_maplist=phased_maplist.4x,genotypes=SNP_dosages.4x,ploidy=4)

Run the code above in your browser using DataLab