Learn R Programming

SeqArray: Big Data Management of Whole-genome Sequence Variant Calls

GNU General Public License, GPLv3

Features

Big data management of whole-genome sequence variant calls with thousands of individuals: genotypic data (e.g., SNVs, indels and structural variation calls) and annotations in GDS files are stored in an array-oriented and compressed manner, with efficient data access using the R programming language.

The SeqArray package is built on top of Genomic Data Structure (GDS) data format, and defines required data structure for a SeqArray file. GDS is a flexible and portable data container with hierarchical structure to store multiple scalable array-oriented data sets. It is suited for large-scale datasets, especially for data which are much larger than the available random-access memory. It also offers the efficient operations specifically designed for integers of less than 8 bits, since a single genetic/genomic variant, like SNP, usually occupies fewer bits than a byte. Data compression and decompression are available with relatively efficient random access. A high-level R interface to GDS files is available in the package gdsfmt (http://bioconductor.org/packages/gdsfmt).

Bioconductor:

Release Version: v1.12.5

http://www.bioconductor.org/packages/release/bioc/html/SeqArray.html

Development Version: v1.13.3

http://www.bioconductor.org/packages/devel/bioc/html/SeqArray.html

Installation (requiring >=R_v3.3.0)

  • Bioconductor repository:
source("http://bioconductor.org/biocLite.R")
biocLite("SeqArray")
  • Development version from Github:
library("devtools")
install_github("zhengxwen/gdsfmt")
install_github("zhengxwen/SeqArray")

The install_github() approach requires that you build from source, i.e. make and compilers must be installed on your system -- see the R FAQ for your operating system; you may also need to install dependencies manually.

  • Install the package from the source code:

gdsfmt, SeqArray

wget --no-check-certificate https://github.com/zhengxwen/gdsfmt/tarball/master -O gdsfmt_latest.tar.gz
wget --no-check-certificate https://github.com/zhengxwen/SeqArray/tarball/master -O SeqArray_latest.tar.gz
R CMD INSTALL gdsfmt_latest.tar.gz
R CMD INSTALL SeqArray_latest.tar.gz

## Or
curl -L https://github.com/zhengxwen/gdsfmt/tarball/master/ -o gdsfmt_latest.tar.gz
curl -L https://github.com/zhengxwen/SeqArray/tarball/master/ -o SeqArray_latest.tar.gz
R CMD INSTALL gdsfmt_latest.tar.gz
R CMD INSTALL SeqArray_latest.tar.gz

SeqArray File Download

Examples

library(SeqArray)

gds.fn <- seqExampleFileName("gds")

# open a GDS file
f <- seqOpen(gds.fn)

# display the contents of the GDS file
f

# close the file
seqClose(f)
## Object of class "SeqVarGDSClass"
## File: SeqArray/extdata/CEU_Exon.gds (387.3K)
## +    [  ] *
## |--+ description   [  ] *
## |--+ sample.id   { VStr8 90 ZIP_ra(30.8%), 222B }
## |--+ variant.id   { Int32 1348 ZIP_ra(35.7%), 1.9K }
## |--+ position   { Int32 1348 ZIP_ra(86.4%), 4.6K }
## |--+ chromosome   { VStr8 1348 ZIP_ra(2.66%), 91B }
## |--+ allele   { VStr8 1348 ZIP_ra(17.2%), 928B }
## |--+ genotype   [  ] *
## |  |--+ data   { Bit2 2x90x1348 ZIP_ra(28.4%), 16.8K } *
## |  |--+ ~data   { Bit2 2x1348x90 ZIP_ra(36.0%), 21.3K } *
## |  |--+ extra.index   { Int32 3x0 ZIP_ra, 17B } *
## |  \--+ extra   { Int16 0 ZIP_ra, 17B }
## |--+ phase   [  ]
## |  |--+ data   { Bit1 90x1348 ZIP_ra(0.36%), 55B } *
## |  |--+ ~data   { Bit1 1348x90 ZIP_ra(0.36%), 55B } *
## |  |--+ extra.index   { Int32 3x0 ZIP_ra, 17B } *
## |  \--+ extra   { Bit1 0 ZIP_ra, 17B }
## |--+ annotation   [  ]
## |  |--+ id   { VStr8 1348 ZIP_ra(41.0%), 5.8K }
## |  |--+ qual   { Float32 1348 ZIP_ra(0.91%), 49B }
## |  |--+ filter   { Int32,factor 1348 ZIP_ra(0.89%), 48B } *
## |  |--+ info   [  ]
## |  |  |--+ AA   { VStr8 1348 ZIP_ra(24.2%), 653B } *
## |  |  |--+ AC   { Int32 1348 ZIP_ra(27.2%), 1.4K } *
## |  |  |--+ AN   { Int32 1348 ZIP_ra(20.6%), 1.1K } *
## |  |  |--+ DP   { Int32 1348 ZIP_ra(62.6%), 3.3K } *
## |  |  |--+ HM2   { Bit1 1348 ZIP_ra(117.2%), 198B } *
## |  |  |--+ HM3   { Bit1 1348 ZIP_ra(117.2%), 198B } *
## |  |  |--+ OR   { VStr8 1348 ZIP_ra(14.0%), 238B } *
## |  |  |--+ GP   { VStr8 1348 ZIP_ra(34.4%), 5.3K } *
## |  |  \--+ BN   { Int32 1348 ZIP_ra(21.6%), 1.1K } *
## |  \--+ format   [  ]
## |     \--+ DP   [  ] *
## |        |--+ data   { Int32 90x1348 ZIP_ra(33.8%), 160.3K }
## |        \--+ ~data   { Int32 1348x90 ZIP_ra(32.2%), 152.8K }
## \--+ sample.annotation   [  ]
##    \--+ family   { VStr8 90 ZIP_ra(34.7%), 135B }

Significant User-visible Changes (since v1.11.16)

  • seqSummary(gds, "genotype")$seldim returns a vector with 3 integers (ploidy, # of selected samples, # of selected variants) instead of 2 integers

Copy Link

Version

Version

1.12.5

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Xiuwen Zheng

Last Published

February 15th, 2017

Functions in SeqArray (1.12.5)

seqDelete

Delete GDS Variables
seqBED2GDS

Convert PLINK BED Format to SeqArray Format
seqClose-methods

Close the SeqArray GDS File
SeqArray-package

Big Data Management of Genome-wide Sequence Variants
seqApply

Apply Functions Over Array Margins
seqExampleFileName

Example files
seqGDS2SNP

Convert to a SNP GDS File
seqAlleleFreq

Get Allele Frequencies or Counts
seqDigest

Hash function digests
seqExport

Export to a GDS File
seqGDS2VCF

Convert to a VCF File
seqMerge

Merge Multiple SeqArray GDS Files
seqMissing

Missing genotype percentage
seqOptimize

Optimize the Storage of Data Array
seqOpen

Open a SeqArray GDS File
seqParallel

Apply Functions in Parallel
seqNumAllele

Number of alleles
seqParallelSetup

Setup a Parallel Environment
seqGetFilter

Get the Filter of GDS File
seqGetData

Get Data
seqSystem

Get the parameters in the GDS system
seqTranspose

Transpose Data Array
seqStorageOption

Storage and Compression Options
SeqVarGDSClass

SeqVarGDSClass
seqVCF_SampID

Get the Sample IDs
seqSummary

Summarize a SeqArray GDS File
seqSNP2GDS

Convert SNPRelate Format to SeqArray Format
seqVCF_Header

Parse the Header of a VCF File
seqSetFilter-methods

Set a Filter to Sample or Variant
seqVCF2GDS

Reformat VCF Files