Modern genomic datasets are big (large n), high-dimensional (large p), and multi-layered. The challenges that need to be addressed are memory requirements and computational demands. Our goal is to develop software that will enable researchers to carry out analyses with big genomic data within the R environment.
The extdata
folder contains example files that were generated from
the 250k SNP and phenotype data in
Atwell et al. (2010).
Only the first 300 SNPs of chromosome 1, 2, and 3 were included to keep the
size of the example dataset small.
PLINK was used to convert the
data to .bed and
.raw files.
FT10
has been chosen as a phenotype and is provided as an
alternate phenotype
file. The file is intentionally shuffled to demonstrate that the
additional phenotypes are put in the same order as the rest of the
phenotypes.
We have identified several approaches to tackle those challenges within R:
File-backed matrices: The data is stored in on the hard drive and users can read in smaller chunks when they are needed.
Linked arrays: For very large datasets a single file-backed array may not be enough or convenient. A linked array is an array whose content is distributed over multiple file-backed nodes.
Multiple dispatch: Methods are presented to users so that they can treat these arrays pretty much as if they were RAM arrays.
Multi-level parallelism: Exploit multi-core and multi-node computing.
Inputs: Users can create these arrays from standard formats (e.g., PLINK .bed).
The BGData
package is an umbrella package that comprises several
packages: BEDMatrix
, LinkedMatrix
, and symDMatrix
. It
features scalable and efficient computational methods for large genomic
datasets such as genome-wide association studies (GWAS) or genomic
relationship matrices (G matrix). It also contains a container class called
BGData
that holds genotypes, sample information, and variant
information.
BEDMatrix-package
,
LinkedMatrix-package
, and
symDMatrix-package
for an introduction to the
respective packages.
file-backed-matrices
for more information on file-backed
matrices. multi-level-parallelism
for more information on
multi-level parallelism.