plinkToNcdf: Create a netCDF file and annotation suitable for use in GWASTools from PLINK files

Description

plinkToNcdf creates a netCDF file and scan and SNP annotation objects from a set of ped and map files. This function is deprecated, use snpgdsBED2GDS instead.

Usage

plinkToNcdf(pedFile, mapFile, nSamples, ncdfFile, snpAnnotFile, scanAnnotFile, ncdfXchromCode=23, ncdfXYchromCode=24, ncdfYchromCode=25, ncdfMchromCode=26, ncdfUchromCode=27, pedMissingCode=0, verbose=TRUE)

Arguments

pedFile

PLINK ped file.

mapFile

PLINK map file. Columns should be chromosome, rsID, map distance (not used, but included in output annotation), and base-pair position. If this is an extended map file (.bim), columns 5 and 6 will be used to encode allele A and allele B.

nSamples

Number of samples in the ped file.

ncdfFile

Output netCDF file.

snpAnnotFile

Output .RData file for storing a SnpAnnotationDataFrame.

scanAnnotFile

Output .RData file for storing a ScanAnnotationDataFrame.

ncdfXchromCode

Integer value used to represent the X chromosome in the netCDF file. Values of "X" or "23" in the map file are converted to this code.

ncdfXYchromCode

Integer value used to represent the pseudoautosomal region of the X and Y chromsomes in the netCDF file. Values of "XY" or "25" in the map file are converted to this code.

ncdfYchromCode

Integer value used to represent the Y chromosome in the netCDF file. Values of "Y" or "24" in the map file are converted to this code.

ncdfMchromCode

Integer value used to represent mitochondrial SNPs in the netCDF file. Values of "MT" or "26" in the map file are converted to this code.

ncdfUchromCode

Integer value used to represent unknown chromosome in the netCDF file. Any values in the map file not in (1:26, "X", "Y", "XY", "MT") are converted to this code.

pedMissingCode

Missing genotype code in the ped file.

verbose

logical for whether to show progress information.

Details

The netCDF file stores genotype data in byte format, so the PLINK genotype is converted to number of A alleles (0, 1, 2, or missing). The definitions of A and B alleles may be provided in the map file (column 5=A, column 6=B). Otherwise, A and B definitions will be based on the order alleles are encountered in the ped file. (Note that converting between ped/map format and bed/bim/fam format in PLINK will not always preserve the order of chromosomes, so use caution when matching a bim file to a ped file!) The first six columns of the ped file will be converted to a ScanAnnotationDataFrame. If the Individual ID (second column of the ped file) contains unique integers, then this column will be used for scanID. Otherwise, an integer vector of scanID will be generated as 1:nSamples. This ID is used to index scans in the netCDF file.

The map file will be converted to a SnpAnnotationDataFrame. This SNP annotation will include the definitions of A and B alleles in the netCDF file (either as provided or determined from the data as described above). A unique integer snpID will be generated for each SNP, which is used to index SNPs in the netCDF file. Note that the default values of ncdfXYchromCode=24, ncdfYchromCode=25, and ncdfUchromCode=27 correspond to the default chromosome codes for NcdfGenotypeReader and SnpAnnotationDataFrame, and are different from the values used by PLINK (Y=24, XY=25, U=0). If the netCDF file is created with different chromosome codes by specifying these arguments, one must also specify the chromosome codes when opening the file, e.g. NcdfGenotypeReader(ncdfFile, XYchromCode=25, YchromCode=24). nSamples is used to allocate space in the netCDF file. A warning will be issued if the number of lines read in the ped file is different from this number.

References

Please see http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped for more information on PLINK files.

Examples

Run this code

library(GWASdata)
pedfile <- system.file("extdata", "illumina_subj.ped", package="GWASdata")
mapfile <- system.file("extdata", "illumina_subj.map", package="GWASdata")
ncfile <- tempfile()
scanfile <- tempfile()
snpfile <- tempfile()
plinkToNcdf(pedfile, mapfile, nSamples=43, ncdfFile=ncfile,
   snpAnnotFile=snpfile, scanAnnotFile=scanfile)

nc <- NcdfGenotypeReader(ncfile)
scanAnnot <- getobj(scanfile)
snpAnnot <- getobj(snpfile)
genoData <- GenotypeData(nc, scanAnnot=scanAnnot, snpAnnot=snpAnnot)
prefix <- sub(".ped", "", pedfile, fixed=TRUE)
log <- tempfile()
stopifnot(plinkCheck(genoData, prefix, log))
close(genoData)

# provide allele coding with extended map file
# .bim might have SNPs in different order than .map
bimfile <- system.file("extdata", "illumina_subj.bim", package="GWASdata")
bim <- read.table(bimfile, as.is=TRUE, header=FALSE)
map <- read.table(mapfile, as.is=TRUE, header=FALSE)
snp.match <- match(map[,2], bim[,2])
map <- cbind(map, bim[snp.match, 5:6])
mapfile.ext <- tempfile()
write.table(map, file=mapfile.ext, quote=FALSE, row.names=FALSE, col.names=FALSE)
# use chromosome codes that match PLINK
plinkToNcdf(pedfile, mapfile, nSamples=43, ncdfFile=ncfile,
   snpAnnotFile=snpfile, scanAnnotFile=scanfile,
   ncdfYchromCode=24, ncdfXYchromCode=25)

# must specify different chromosome codes in NcdfGenotypeReader
# appending "L" ensures the codes are integers, as required
nc <- NcdfGenotypeReader(ncfile, YchromCode=24L, XYchromCode=25L)
scanAnnot <- getobj(scanfile)
snpAnnot <- getobj(snpfile)
genoData <- GenotypeData(nc, scanAnnot=scanAnnot, snpAnnot=snpAnnot)
stopifnot(plinkCheck(genoData, prefix, log))
close(genoData)

file.remove(ncfile, scanfile, snpfile, log, mapfile.ext)