Learn R Programming

PopGenome (version 2.7.2)

readVCF: Read SNP data in tabixed VCF format

Description

This function reads tabixed VCF-files, as distributed from the 1000 Genomes project (human).

Usage

readVCF(filename, numcols, tid, frompos, topos,
        samplenames=NA, gffpath = FALSE, include.unknown=FALSE, approx=FALSE,
	out="", parallel=FALSE)

Arguments

filename

the corresponding tabixed VCF-file

numcols

number of SNPs that should be read in as a chunk

tid

which chromosome ? (character)

frompos

start of the region

topos

end of the region

samplenames

a vector of individuals

gffpath

the corresponding GFF file

include.unknown

includ positions with unknown/missing nucleotides

approx

see details !

out

a folder suffix where the temporary files should be saved

parallel

parallel computation using mclapply

Value

The function creates an object of class "GENOME" --------------------------------------------------------- The following slots will be filled in the "GENOME" object ---------------------------------------------------------

Slot Description
1. n.sites total number of sites
2. n.biallelic.sites number of biallelic sites
3. region.data some detailed information about the data read
4. region.names names of regions

Details

The readVCF function expects a tabixed VCF file with a diploid GT field. In case of haploid data, the GT field has to be transformed to a pseudo-diploid field (such as 0 -> 0|0). An alternative is to use readData(..., format="VCF"), which can read non-tabixed haploid and any kind of polyploid VCFs directly. When approx=TRUE, the algorithm will apply a logical OR to the GT-field: (0|0=0,1|0=1,0|1=1,1|1=1). Note, this is an approximation for diploid data, which will speed up calculations. In case of haploid data, approx should be switched to TRUE. If approx=FALSE, the full diploid information will be considered. The ff-package PopGenome uses to store the SNP information limits total data size to individuals * (number of SNPs) <= .Machine$integer.max In case of very large data sets, the bigmemory package will be used; this will slow down calculations (e.g. this package have to be installed first !!!). Use the function vcf_handle <-.Call("VCF_open", filename) to open a VCF-file and .Call("VCF_getSampleNames",vcf_handle) to get and define the individuals which should be considered in the analysis. See also readData(..., format="VCF") !

Examples

Run this code
# NOT RUN {
# GENOME.class <- readVCF("...\chr1.vcf.gz", 1000, "1", 1, 100000)
# GENOME.class
# GENOME.class@region.names
# GENOME.class <- neutrality.stats(GENOME.class,FAST=TRUE)
# show the result:
# get.sum.data(GENOME.class)
# GENOME.class@region.data

# }

Run the code above in your browser using DataLab