Learn R Programming

GWASTools (version 1.18.0)

gdsSubset: Write a subset of data in a GDS file to a new GDS file

Description

gdsSubset takes a subset of data (snps and samples) from a GDS file and write it to a new GDS file. gdsSubsetCheck checks that a GDS file is the desired subset of another GDS file.

Usage

gdsSubset(parent.gds, sub.gds, sample.include=NULL, snp.include=NULL, sub.storage=NULL, compress="ZIP_RA", block.size=5000, verbose=TRUE)
gdsSubsetCheck(parent.gds, sub.gds, sample.include=NULL, snp.include=NULL, sub.storage=NULL, verbose=TRUE)

Arguments

parent.gds
Name of the parent GDS file
sub.gds
Name of the subset GDS file
sample.include
Vector of sampleIDs to include in sub.gds
snp.include
Vector of snpIDs to include in sub.gds
sub.storage
storage type for the subset file; defaults to original storage type
compress
The compression level for variables in a GDS file (see add.gdsn for options.
block.size
for GDS files stored with scan,snp dimensions, the number of SNPs to read from the parent file at a time. Ignored for snp,scan dimensions.
verbose
Logical value specifying whether to show progress information.

Details

gdsSubset can select a subset of snps for all samples by setting snp.include, a subset of samples for all snps by setting sample.include, or a subset of snps and samples with both arguments. The GDS nodes "snp.id", "snp.position", "snp.chromosome", and "sample.id" are copied, as well as any 2-dimensional nodes. Other nodes are not copied. The attributes of the 2-dimensional nodes are also copied to the subset file. If sub.storage is specified, the subset gds file will have a different storage mode for any 2-dimensional array. In the special case where the 2-dimensional node has an attribute named "missing.value" and the sub.storage type is "bit2", the missing.value attribute for the subset node is automatically set to 3. At this point, no checking is done to ensure that the values will be properly stored with a different storage type, but gdsSubsetCheck will return an error if the values do not match. If the nodes in the GDS file are stored with scan,snp dimensions, then block.size allows you to loop over a block of SNPs at a time. If the nodes are stored with snp,scan dimensions, then the function simply loops over samples, one at a time. gdsSubsetCheck checks that a subset GDS file has the expected SNPs and samples of the parent file. It also checks that attributes were similarly copied, except for the above-mentioned special case of missing.value for sub.storage="bit2".

See Also

gdsfmt, createDataFile

Examples

Run this code
gdsfile <- system.file("extdata", "illumina_geno.gds", package="GWASdata")
gds <- GdsGenotypeReader(gdsfile)
sample.sel <- getScanID(gds, index=1:10)
snp.sel <- getSnpID(gds, index=1:100)
close(gds)

subfile <- tempfile()
gdsSubset(gdsfile, subfile, sample.include=sample.sel, snp.include=snp.sel)
gdsSubsetCheck(gdsfile, subfile, sample.include=sample.sel, snp.include=snp.sel)

file.remove(subfile)

Run the code above in your browser using DataLab