haplofreq: Evaluates the Occurrence of Haplotype Segments in Particular Breeds

Description

For each haplotype from thisBreed and every SNP the occurence of the haplotype segment containing the SNP in a set of reference breeds is evaluated. The maximum frequency each segment has in one of these reference breeds is computed, and the breed in which the segment has maximum frequency is identified. Results are either returned in a list or saved to files.

Usage

haplofreq(files, phen, map, thisBreed, refBreeds="others", minSNP=20, minL=1.0, 
  unitL="Mb", ubFreq=0.01, keep=NULL, skip=NA, cskip=NA, w.dir=NA, 
  what=c("freq", "match"), cores=1, quiet=FALSE)

Value

If w.dir=NA then a list is returned. The list may have the following components:

freq: Mx(2N) - matrix containing for every SNP and for each of the 2N haplotypes from thisBreed the maximum frequency the segment containing the SNP has in a the reference breeds.
match: Mx(2N) - matrix containing for every SNP and for each of the 2N haplotypes from thisBreed the first letter of the name of the reference breed in which the segment containing the SNP has maximum frequency. Segments with frequencies smaller than ubFreq in all reference breeds are marked as '1', which indicates that the segment is native for thisBreed.

The list has attributes thisBreed, and map.

If w.dir is the name of a directory, then results are written to files, whereby each file corresponds to one chromosome, and a data frame with file names is returned.

Arguments

files

Either a character vector with file names, or a list containing character vectors with file names. The files contain phased genotypes, one file for each chromosome. File names must contain the chromosome name as specified in the map in the form "ChrNAME.", e.g. "Breed2.Chr1.phased". The required format of the marker files is described under Details.

If file is a character vector then, genotypes of all animals must be in the same files. Alternatively, files can be a list with the following two components:

hap.thisBreed: Character vector with names of the phased marker files for the individuals from thisBreed, one file for each chromosome.

hap.refBreeds: Character vector with names of the phased marker files for the individuals from the reference breeds (refBreeds), one file for each chromosome. If this component is missing, then it is assumed that the haplotypes of these animals are also included in hap.thisBreed.

phen

Data frame containing the ID (column "Indiv") and the breed name (column "Breed") of each genotyped individual.

map

Data frame providing the marker map with columns including marker name 'Name', chromosome number 'Chr', and possibly the position on the chromosome in mega base pairs 'Mb', and the position in centimorgan 'cM'. The order of the markers must be the same as in the files files. Marker names must have no white spaces.

thisBreed

Name of a breed from column Breed in phen: The occurence of each haplotype segment from this breed in the reference breeds will be evaluated.

refBreeds

Vector with names of breeds from column Breed in phen. These breeds are used as reference breeds. The occurence of haplotype segments in these breeds will be evaluated. By default, all breeds in phen, except thisBreed are used as reference breeds. In contrast, for refBreeds="all", all genotyped breeds are used as reference breeds.

minSNP

Minimum number of marker SNPs included in a segment.

minL

Minimum length of a segment in unitL (e.g. in cM or Mb).

unitL

The unit for measuring the length of a segment. Possible units are the number of marker SNPs included in the segment ('SNP'), the number of mega base pairs ('Mb'), and the genetic distances between the first and the last marker in centiMorgan ('cM'). In the last two cases the map must include columns with the respective names.

ubFreq

If a haplotype segment has frequency smaller than ubFreq in all reference breeds then the breed name is replaced by '1', which indicates that the segment is native.

keep

Subset of the IDs of the individuals from data frame phen, or a logical vector indicating the animals in data frame phen that should be used. The default keep=NULL means that all individuals included in phen will be considered.

skip

Take line skip+1 of the files as the line with column names. By default, the number is determined automatically.

cskip

Take column cskip+1 of the files as the first column with genotypes. By default, the number is determined automatically.

w.dir

Output file directory. Writing results to files has the advantage that much less working memory is required. By default, no files are created. The function returns only the file names if files are created.

what

For what="freq", the maximum frequency each haplotype segment has in the reference breeds will be computed. For what="match", the name of the reference breed in which the segment has maximum frequency will be determined. By default, the frequencies and the breed names both are determined.

cores

Number of cores to be used for parallel processing of chromosomes. By default one core is used. For cores=NA the number of cores will be chosen automatically. Using more than one core increases execution time if the function is already fast.

quiet

Should console output be suppressed?

Author

Robin Wellmann

Details

Marker file format: Each marker file containing phased genotypes has a header and no row names. Cells are separated by blank spaces. The number of rows is equal to the number of markers from the respective chromosome and the markers are in the same order as in the map. The first cskip columns are ignored. The remaining columns contain genotypes of individuals written as two alleles separated by a character, e.g. A/B, 0/1, A|B, A B, or 0 1. The same two symbols must be used for all markers. Column names are the IDs of the individuals. If the blank space is used as separator then the ID of each individual should repeated in the header to get a regular delimited file. The columns to be skipped and the individual IDs must have no white spaces.

Examples

Run this code

data(map)
data(Cattle)
dir   <- system.file("extdata", package="optiSel")
files <- file.path(dir, paste("Chr", 1:2, ".phased", sep=""))

Freq <- freqlist(
 haplofreq(files, Cattle, map, thisBreed="Angler", refBreeds="Rotbunt",   minL=2.0),
 haplofreq(files, Cattle, map, thisBreed="Angler", refBreeds="Holstein",  minL=2.0),
 haplofreq(files, Cattle, map, thisBreed="Angler", refBreeds="Fleckvieh", minL=2.0)
  )

plot(Freq, ID=1, hap=2, refBreed="Rotbunt")
plot(Freq, ID=1, hap=2, refBreed="Holstein", Chr=1)

# \donttest{
## Test for using multiple cores:

Freq1 <- haplofreq(files, Cattle, map, thisBreed="Angler", refBreeds="Rotbunt", 
                   minL=2.0, cores=NA)$freq
range(Freq[[1]]-Freq1)
#[1] 0 0
# }

## Creating output files with allele frequencies and allele origins:
# \donttest{
rdir  <- system.file("extdata", package = "optiSel")
wdir  <- file.path(tempdir(), "HaplotypeEval")
chr   <- unique(map$Chr)
files <- file.path(rdir, paste("Chr", chr, ".phased", sep=""))
wfile <- haplofreq(files, Cattle, map, thisBreed="Angler", minL=2.0, w.dir=wdir)

View(read.table(wfile$match[1],skip=1))
#unlink(wdir, recursive = TRUE)
# }

Run the code above in your browser using DataLab