readVcf: Read VCF files

Description

Read Variant Call Format (VCF) files

Usage

"readVcf"(file, genome, param,  ..., row.names=TRUE) "readVcf"(file, genome, param,  ..., row.names=TRUE) "readVcf"(file, genome, param,  ..., row.names=TRUE) "readVcf"(file, genome, param,  ..., row.names=TRUE) "readVcf"(file, genome, param,  ..., row.names=TRUE) "readVcf"(file, genome, param,  ..., row.names=TRUE) "readVcf"(file, genome, param,  ..., row.names=TRUE) "readVcf"(file, genome, param,  ..., row.names=TRUE)
## Lightweight functions to read a single variable
readInfo(file, x, param=ScanVcfParam(), ..., row.names=TRUE)
readGeno(file, x, param=ScanVcfParam(), ..., row.names=TRUE)
readGT(file, nucleotides=FALSE, param=ScanVcfParam(), ..., row.names=TRUE)

Arguments

file

A TabixFile instance or character() name of the VCF file to be processed. When ranges are specified in param, file must be a TabixFile.

Use of the TabixFile methods are encouraged as they are more efficient than the character() methods. See ?TabixFile and ?indexTabix for help creating a TabixFile.

genome

A character or Seqinfo object.

character: Genome identifier as a single string or named character vector. Names of the character vector correspond to chromosome names in the file. This identifier replaces the genome information in the VCF Seqinfo (i.e., seqinfo(vcf)).
Seqinfo: When genome is provided as a Seqinfo it is propagated to the VCF output. If seqinfo information can be obtained from the file, (i.e., seqinfo(scanVcfHeader(fl)) is not empty), the output Seqinfo is a product of merging the two.
If a param (i.e., ScanVcfParam) is used in the call to readVcf, the seqlevels of the param ranges must be present in genome.

param

An instance of ScanVcfParam, GRanges, GRangesList or RangesList. VCF files can be subset on genomic coordinates (ranges) or elements in the VCF fields. Both genomic coordinates and VCF elements can be specified in a ScanVcfParam. See ?ScanVcfParam for details.

character name of single info or geno field to import. Applicable to readInfo and readGeno only.

row.names

A logical specifying if rownames should be returned. In the case of readVcf, rownames appear on the GRanges returned by the rowRanges accessor.

nucleotides

A logical indicating if genotypes should be returned as nucleotides instead of the numeric representation. Applicable to readGT only.

...

Additional arguments, passed to methods.

Value

readVcf returns a VCF object. See ?VCF for complete details of the class structure. readGT, readInfo and readGeno return a matrix.

rowRanges:: The CHROM, POS, ID and REF fields are used to create a GRanges object. Ranges are created using POS as the start value and width of the reference allele (REF). By default, the IDs become the rownames ('row.names = FALSE' to turn this off). If IDs are missing (i.e., ‘.’) a string of CHROM:POS_REF/ALT is used instead. The genome argument is stored in the seqinfo of the GRanges and can be accessed with genome().One metadata column, paramRangeID, is included with the rowRanges. This ID is meaningful when multiple ranges are specified in the ScanVcfParam and distinguishes which records match each range.
fixed:: REF, ALT, QUAL and FILTER fields of the VCF are parsed into a DataFrame.REF is returned as a DNAStringSet. ALT will be a character vector for structural variants and a DNAStringSetList otherwise. '*' characters are treated as missings and become an empty character in a DNAStringSetList. 'I' characters are treated as undefined and become a '.' in a DNAStringSetList.
info:: Data from the INFO field of the VCF is parsed into a DataFrame.
geno:: If present, the genotype data are parsed into a list of matrices or arrays. Each list element represents a field in the FORMAT column of the VCF file. Rows are the variants, columns are the samples.
colData:: This slot contains a DataFrame describing the samples. If present, the sample names following FORMAT in the VCF file become the row names.
metadata:: Header information present in the file is put into a list in metadata.

See references for complete details of the VCF file format.

Details

Data Import:

VCF object:

readVcf imports records from bzip compressed or uncompressed VCF files. Data are parsed into a VCF object using the file header information if available. To import a subset of ranges the VCF must have a Tabix index file. An index file can be created with bzip and indexTabix functions.

The readInfo, readGeno and readGT functions are lightweight versions of readVcf and import a single variable. The return object is a vector, matrix or CompressedList instead of a VCF class.

readVcf calls scanVcf, the details of which can be found with ?scanVcf.

Header lines (aka Meta-information):

readVcf() reads and parses fields according to the multiplicity and data type specified in the header lines. Fields without header lines are skipped (not read or parsed). To see what fields are present in the header use scanVcfHeader(). See ?VCFHeader for more details.

Passing verbose = TRUE to readVcf() prints the fields with header lines that will be parsed by readVcf.

Data type:

CHROM, POS, ID and REF fields are used to create the GRanges stored in the VCF object and accessible with the rowRanges accessor.

REF, ALT, QUAL and FILTER are parsed into the DataFrame in the fixed slot. Because ALT can have more than one value per variant it is represented as a DNAStringSetList. REF is a DNAStringSet, QUAL is numeric and FILTER is a character. Accessors include fixed, ref, alt, qual, and filt.

Data from the INFO field can be accessed with the info accessor. Genotype data (i.e., data immediately following the FORMAT field in the VCF) can be accessed with the geno accessor. INFO and genotype data types are determined according to the ‘Number’ and ‘Type’ information in the file header as follows:

‘Number’ should only be 0 when ‘Type’ is 'flag'. These fields are parsed as logical vectors. If ‘Number’ is 1, ‘info’ data are parsed into a vector and ‘geno’ into a matrix.

If ‘Number’ is >1, ‘info’ data are parsed into a DataFrame with the same number of columns. ‘geno’ are parsed into an array with the same dimensions as ‘Number’. Columns of the ‘geno’ matrices are the samples.

If ‘Number’ is ‘.’, ‘A’ or ‘G’, both ‘info’ and ‘geno’ data are parsed into a matrix.

When the header does not contain any ‘INFO’ lines, the data are returned as a single, unparsed column.

Missing data:

Missing data in VCF files are represented by a dot ("."). readVcf retains the dot as a character string for data type character and converts it to NA for data types numeric or double.

Because the data are stored in rectangular data structures there is a value for each info and geno field element in the VCF class. If the element was missing or was not collected for a particular variant the value will be NA.

Efficient Usage:

Subsets of data (i.e., specific variables, positions or samples) can be read from a VCF file by providing a ScanVcfParam object in the call to readVcf. Other lightweight options are the readGT, readInfo and readGeno functions which return data as a matrix instead of the VCF class. Another option for handling large files is to iterate through the data in chunks by setting the yieldSize parameter in a TabixFile object. Iteration can be through all data fields or a subset defined by a ScanVcfParam. See example below, `Iterating through VCF with yieldSize`.

References

http://vcftools.sourceforge.net/specs.html outlines the VCF specification.

http://samtools.sourceforge.net/mpileup.shtml contains information on the portion of the specification implemented by bcftools.

http://samtools.sourceforge.net/ provides information on samtools.

Examples

Run this code

  fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") 
  vcf <- readVcf(fl, "hg19")
  ## vcf <- readVcf(fl, c("20"="hg19"))  ## 'genome' as named vector

  ## ---------------------------------------------------------------------
  ## Header and genome information 
  ## ---------------------------------------------------------------------
  vcf

  ## all header information
  hdr <- header(vcf)

  ## header information for 'info' and 'fixed' fields
  info(hdr)
  fixed(hdr)

  ## ---------------------------------------------------------------------
  ## Accessors
  ## ---------------------------------------------------------------------
  ## fixed fields together
  head(fixed(vcf), 5)

  ## fixed fields separately 
  filt(vcf)
  ref(vcf) 

  ## info data 
  info(hdr)
  info(vcf)
  info(vcf)$DP

  ## geno data 
  geno(hdr)
  geno(vcf)
  head(geno(vcf)$GT)

  ## genome
  unique(genome(rowRanges(vcf)))

  ## ---------------------------------------------------------------------
  ## Data subsets with lightweight read* functions 
  ## ---------------------------------------------------------------------

  ## Import a single 'info' or 'geno' variable
  DP <- readInfo(fl, "DP")
  HQ <- readGeno(fl, "HQ")

  ## Import GT as numeric representation 
  GT <- readGT(fl)
  ## Import GT as nucleotides 
  GT <- readGT(fl, nucleotides=TRUE)

  ## ---------------------------------------------------------------------
  ## Data subsets with ScanVcfParam
  ## ---------------------------------------------------------------------

  ## Subset on genome coordinates:
  ## 'file' must have a Tabix index
  rngs <- GRanges("20", IRanges(c(14370, 1110000), c(17330, 1234600)))
  names(rngs) <- c("geneA", "geneB")
  param <- ScanVcfParam(which=rngs) 
  compressVcf <- bgzip(fl, tempfile())
  idx <- indexTabix(compressVcf, "vcf")
  tab <- TabixFile(compressVcf, idx)
  vcf <- readVcf(tab, "hg19", param)

  ## When data are subset by range ('which' argument in ScanVcfParam),
  ## the 'paramRangeID' column provides a map back to the original 
  ## range in 'param'.
  rowRanges(vcf)[,"paramRangeID"]
  vcfWhich(param)

  ## Subset on samples:
  ## Consult the header for the sample names.
  samples(hdr) 
  ## Specify one or more names in 'samples' in a ScanVcfParam.
  param <- ScanVcfParam(samples="NA00002")
  vcf <- readVcf(tab, "hg19", param)
  geno(vcf)$GT

  ## Subset on 'fixed', 'info' or 'geno' fields:
  param <- ScanVcfParam(fixed="ALT", geno=c("GT", "HQ"), info=c("NS", "AF"))
  vcf_tab <- readVcf(tab, "hg19", param)
  info(vcf_tab)
  geno(vcf_tab)

  ## No ranges are specified in the 'param' so tabix file is not
  ## required. Instead, the uncompressed VCF can be used as 'file'.
  vcf_fname <- readVcf(fl, "hg19", param)

  ## The header will always contain information for all variables
  ## in the original file reguardless of how the data were subset.
  ## For example, all 'geno' fields are listed in the header 
  geno(header(vcf_fname))

  ## but only 'GT' and 'HQ' are present in the VCF object.
  geno(vcf_fname)

  ## Subset on both genome coordinates and 'info', 'geno' fields: 
  param <- ScanVcfParam(geno="HQ", info="AF", which=rngs)
  vcf <- readVcf(tab, "hg19", param)

  ## When any of 'fixed', 'info' or 'geno' are omitted (i.e., no
  ## elements specified) all records are retrieved. Use NA to indicate
  ## that no records should be retrieved. This param specifies
  ## all 'fixed fields, the "GT" 'geno' field and none of 'info'.
  ScanVcfParam(geno="GT", info=NA)

  ## ---------------------------------------------------------------------
  ## Iterate through VCF with 'yieldSize' 
  ## ---------------------------------------------------------------------
  fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
  param <- ScanVcfParam(fixed="ALT", geno=c("GT", "GL"), info=c("LDAF"))
  tab <- TabixFile(fl, yieldSize=4000)
  open(tab)
  while (nrow(vcf_yield <- readVcf(tab, "hg19", param=param)))
      cat("vcf dim:", dim(vcf_yield), "\n")
  close(tab)

  ## ---------------------------------------------------------------------
  ## Debugging with 'verbose'
  ## ---------------------------------------------------------------------
  ## readVcf() uses information in the header lines to parse the data to 
  ## the correct number and type. Fields without header lines are skipped. 
  ## If a call to readVcf() results in no info(VCF) or geno(VCF) data the
  ## file may be missing header lines. Set 'verbose = TRUE' to get
  ## a listing of fields found in the header.

  ## readVcf(myfile, "mygenome", verbose=TRUE)

  ## Header fields can also be discovered with scanVcfHeader().
  hdr <- scanVcfHeader(fl)
  geno(hdr)

Run the code above in your browser using DataLab