VCF2RADdata: Create a RADdata Object from a VCF File

Description

This function reads a Variant Call Format (VCF) file containing allelic read depth and SNP alignment positions, such as can be produced by TASSEL or GATK, and generates a RADdata dataset to be used for genotype calling in polyRAD.

Usage

VCF2RADdata(file, phaseSNPs = TRUE, tagsize = 80, refgenome = NULL, 
            tol = 0.01, al.depth.field = "AD", min.ind.with.reads = 200, 
            min.ind.with.minor.allele = 10, possiblePloidies = list(2), 
            taxaPloidy = 2L, contamRate = 0.001, 
            samples = VariantAnnotation::samples(VariantAnnotation::scanVcfHeader(file)),
            svparam = VariantAnnotation::ScanVcfParam(fixed = "ALT", info = NA,
                                                      geno = al.depth.field,
                                                      samples = samples), 
            yieldSize = 5000, expectedAlleles = 5e+05, expectedLoci = 1e+05,
            maxLoci = NA)

Value

A RADdata object.

Arguments

file: The path to a VCF file to be read. This can be uncompressed, bgzipped using Samtools or Bioconductor, or a TabixFile object from Bioconductor.
phaseSNPs: If TRUE, markers that appear to have come from the same set of reads will be phased and grouped into haplotypes. Otherwise, each row of the file will be kept as a distinct marker.
tagsize: The read length, minus any barcode sequence, that was used for genotyping. In TASSEL, this is the same as the kmerLength option. This argument is used for grouping SNPs into haplotypes and is ignored if phaseSNPs = FALSE.
refgenome: Optional. The name of a FASTA file, or an FaFile object, containing the reference genome. When grouping SNPs into haplotypes, if provided this reference genome is used to insert non-variable nucleotides between the variable nucleotides in the alleleNucleotides slot of the RADdata output. Ignored if phaseSNPs = FALSE. Useful if exact SNP positions need to be retained for downstream analysis after genotype calling in polyRAD. In particular this argument is necessary if you plan to export genotype calls back to VCF.
tol: The proportion by which two SNPs can differ in read depth and still be merged into one group for phasing. Ignored if phaseSNPs = FALSE.
al.depth.field: The name of the genotype field in the VCF file that contains read depth at each allele. This should be "AD" unless your format is very unusual.
min.ind.with.reads: Integer used for filtering SNPs. To be retained, a SNP must have at least this many samples with reads.
min.ind.with.minor.allele: Integer used for filtering SNPs. To be retained, a SNP must have at least this many samples with the minor allele. When there are more than two alleles, at least two alleles must have at least this many samples with reads for the SNP to be retained.
possiblePloidies: A list indicating inheritance modes that might be encountered in the dataset. See RADdata.
taxaPloidy: A single integer, or an integer vector with one value per taxon, indicating ploidy. See RADdata.
contamRate: A number indicating the expected sample cross-contamination rate. See RADdata.
samples: A character vector containing the names of samples from the file to export to the RADdata object. The default is all samples. If a subset is provided, filtering with min.ind.with.reads and min.ind.with.minor.allele is performed within that subset. Ignored if a different samples argument is provided within svparam.
svparam: A ScanVcfParam object to be used with readVcf. The primary reasons to change this from the default would be 1) if you want additional FIXED or INFO fields from the file to be exported to the locTable slot of the RADdata object, and/or 2) if you only want to import particular regions of the genome, as specified with the which argument of ScanVcfParam.
yieldSize: An integer indicating the number of lines of the file to read at once. Increasing this number will make the function faster but consume more RAM.
expectedAlleles: An integer indicating the approximate number of alleles that are expected to be imported after filtering and phasing. If this number is too low, the function may slow down considerably. Increasing this number increases the amount of RAM used by the function.
expectedLoci: An integer indicating the approximate number of loci that are expected to be imported after filtering and phasing. If this number is too low, the function may slow down considerably. Increasing this number increases the amount of RAM used by the function.
maxLoci: An integer indicating the approximate maximum number of loci to return. If provided, the function will stop reading the file once it has found at least this many loci that pass filtering and phasing. This argument is intended to be used for generating small RADdata objects for testing purposes, and should be left NA under normal circumstances.

Author

Lindsay V. Clark

Details

This function requires the BioConductor package VariantAnnotation. See https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html for installation instructions.

If you anticipate running VCF2RADdata on the same file more than once, it is recommended to run bgzip and indexTabix from the package Rsamtools once before running VCF2RADdata. See examples.

min.ind.with.minor.allele is used for filtering SNPs as the VCF file is read. Additionally, because phasing SNPs into haplotypes can cause some haplotypes to fail to pass this threshold, VCF2RADdata internally runs MergeRareHaplotypes with min.ind.with.haplotype = min.ind.with.minor.allele, then RemoveMonomorphicLoci, before returning the final RADdata object.

References

Variant Call Format specification: http://samtools.github.io/hts-specs/

TASSEL GBSv2 pipeline: https://bitbucket.org/tasseladmin/tassel-5-source/wiki/Tassel5GBSv2Pipeline

GATK: https://gatk.broadinstitute.org/hc/en-us

Tassel4-Poly: https://github.com/guilherme-pereira/tassel4-poly

Examples

Run this code

# get the example VCF installed with polyRAD
exampleVCF <- system.file("extdata", "Msi01genes.vcf", package = "polyRAD")

# \donttest{
# loading VariantAnnotation namespace takes >10s,
# so is excluded from CRAN checks

require(VariantAnnotation)

# Compress and index the VCF before reading, if not already done
if(!file.exists(paste(exampleVCF, "bgz", sep = "."))){
  vcfBG <- bgzip(exampleVCF)
  indexTabix(vcfBG, "vcf")
}

# Read into RADdata object
myRAD <- VCF2RADdata(exampleVCF, expectedLoci = 100, expectedAlleles = 500)

# Example of subsetting by genomic region (first 200 kb on Chr01)
mysv <- ScanVcfParam(fixed = "ALT", info = NA, geno = "AD",
                     samples = samples(scanVcfHeader(exampleVCF)),
                     which = GRanges("01", IRanges(1, 200000)))
myRAD2 <- VCF2RADdata(exampleVCF, expectedLoci = 100, expectedAlleles = 500,
                      svparam = mysv, yieldSize = NA_integer_)
# }

Run the code above in your browser using DataLab