Learn R Programming

SeqArray (version 1.12.5)

seqVCF2GDS: Reformat VCF Files

Description

Reformats Variant Call Format (VCF) files.

Usage

seqVCF2GDS(vcf.fn, out.fn, header=NULL, storage.option="ZIP_RA", info.import=NULL, fmt.import=NULL, genotype.var.name="GT", ignore.chr.prefix="chr", reference=NULL, start=1L, count=-1L, optimize=TRUE, raise.error=TRUE, digest=TRUE, parallel=FALSE, verbose=TRUE)

Arguments

vcf.fn
the file name(s) of VCF format; or a connection object
out.fn
the file name of output GDS file
header
if NULL, header is set to be seqVCF_Header(vcf.fn)
storage.option
specify the storage and compression options, by default seqStorageOption("ZIP_RA"); or "LZMA_RA" to use LZMZ compression algorithm with higher compression ratio
info.import
characters, the variable name(s) in the INFO field for import; or NULL for all variables
fmt.import
characters, the variable name(s) in the FORMAT field for import; or NULL for all variables
genotype.var.name
the ID for genotypic data in the FORMAT column; "GT" by default, VCFv4.0
ignore.chr.prefix
a vector of character, indicating the prefix of chromosome which should be ignored, like "chr"; it is not case-sensitive
reference
genome reference, like "hg19", "GRCh37"; if the genome reference is not available in VCF files, users could specify the reference here
start
the starting variant if importing part of VCF files
count
the maximum count of variant if importing part of VCF files, -1 indicates importing to the end
optimize
if TRUE, optimize the access efficiency by calling cleanup.gds
raise.error
TRUE: throw an error if numeric conversion fails; FALSE: get missing value if numeric conversion fails
digest
a logical value (TRUE/FALSE) or a character ("md5", "sha1", "sha256", "sha384" or "sha512"); add hash codes to the GDS file if TRUE or a digest algorithm is specified
parallel
FALSE (serial processing), TRUE (parallel processing), a numeric value indicating the number of cores, or a cluster object for parallel processing; parallel is passed to the argument cl in seqParallel, see seqParallel for more details
verbose
if TRUE, show information

Value

Return the file name of GDS format with an absolute path.

Details

GDS -- Genomic Data Structures used for storing genetic array-oriented data, and the file format defined in the gdsfmt package.

VCF -- The Variant Call Format (VCF), which is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations.

If there are more than one files in vcf.fn, seqVCF2GDS will merge all VCF files together if they contain the same samples. It is useful to merge multiple VCF files if data are divided by chromosomes.

The real numbers in the VCF file(s) are stored in 32-bit floating-point format by default. Users can set storage.option=seqStorageOption(float.mode="float64") to switch to 64-bit floating point format. Or packed real numbers can be adopted by setting storage.option=seqStorageOption(float.mode="packedreal16:scale=0.0001").

By default, the compression method is "ZIP_RA" (zlib algorithm with default compression level + independent data blocks). Users can maximize the compression ratio by storage.option="ZIP_RA.max" or storage.option=seqStorageOption("ZIP_RA.max"). LZ4 (http://cyan4973.github.io/lz4/) is an option via storage.option="LZ4_RA" or storage.option=seqStorageOption("LZ4_RA"). LZMA (xz, http://tukaani.org/xz/) is another option via storage.option="LZMA_RA" or storage.option=seqStorageOption("LZMA_RA"), and it is known to have higher compression ratio than zlib.

References

Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., et al. (2011). The variant call format and VCFtools. Bioinformatics 27, 2156-2158.

See Also

seqVCF_Header, seqStorageOption, seqMerge, seqGDS2VCF

Examples

Run this code
# the VCF file
vcf.fn <- seqExampleFileName("vcf")

# conversion
seqVCF2GDS(vcf.fn, "tmp.gds")

# conversion in parallel
seqVCF2GDS(vcf.fn, "tmp_p2.gds", parallel=2L)


# display
(f <- seqOpen("tmp.gds"))
seqClose(f)



# convert without the INFO fields
seqVCF2GDS(vcf.fn, "tmp.gds", info.import=character(0))

# display
(f <- seqOpen("tmp.gds"))
seqClose(f)



# convert without the INFO and FORMAT fields
seqVCF2GDS(vcf.fn, "tmp.gds", info.import=character(0), fmt.import=character(0))

# display
(f <- seqOpen("tmp.gds"))
seqClose(f)


# delete the temporary file
unlink(c("tmp.gds", "tmp_p2.gds"), force=TRUE)

Run the code above in your browser using DataLab