fgls()
function is used for parameter estimation.
The arguments to gls.batch()
may be regarded as belonging to four groups:
gls.batch(phenfile,genfile,pedifile,covmtxfile.in=NULL,theta=NULL, snp.names=NULL,input.mode=c(1,2,3),pediheader=FALSE, pedicolname=c("FAMID","ID","PID","MID","SEX"), sep.phe=" ",sep.gen=" ",sep.ped=" ", phen,covars=NULL,med=c("UN","VC"), outfile,col.names=TRUE,return.value=FALSE, covmtxfile.out=NULL, covmtxparams.out=NULL, sizeLab=NULL,Mz=NULL,Bo=NULL,Ad=NULL,Mix=NULL,indobs=NULL)
NULL
, in which case no SNPs are analyzed, and gls.batch()
conducts a single fgls()
regression of the phenotype onto an intercept and covariates (if any). Otherwise, this argument can be either (1) a character string specifying a genotype file of genotype scores (such as 0,1,2, for the additive genetic model) to be read from disk, or (2) a data frame object containing them. In such a file, each row must represent a SNP, each column must represent a subject, and there should NOT be column headers or row numbers. In such a data frame, the reverse holds: each row must represent a subject, and each column, a SNP (e.g. geno
). If the data frame--say, geno
--need be transposed, then use genfile=data.frame(t(geno))
. Using a matrix instead of a data frame is not recommended, because it makes the process of merging data very memory-intensive, and will likely overflow R's workspace unless the sample size or number of SNPs is quite small.
Note that genotype scores need not be integers; they can also be numeric. So, gls.batch()
can be used to analyze imputed dosages, etc.
"ID"
, ordered in the same order as subjects' genotypic data in genfile. Every row in pedifile is matched to a participant in genfile. That is, if reading files from disk (which is recommended), each row i of the pedigree file, which has n rows, matches column i of the genotype file, which has n columns. This is how the program matches subjects in the phenotype file to their genotypic data. The pedigree file or data frame can also include other columns of pedigree information, like father's ID, mother's ID, etc. Argument pediheader (see below) is an indicator of whether the pedigree file on disk has a header or not, with default as FALSE
. Argument pedicolnames (see below) gives the names that gls.batch()
will assign to the columns of pedifile, and the default, c("FAMID","ID","PID","MID","SEX")
, is the familiar "pedigree table" format. In any event, the user's input must somehow provide the program with a column of IDs, labeled as "ID"
.
NULL
, then gls.batch()
will estimate this matrix. The file to be read in must be a single column, with a header, containing the contents of the 'blocks' of an object of class bdsmatrix
; no other file structures are presently compatible. If covmtxfile.in is an actual matrix object, then using one of class bdsmatrix
is a virtual requirement. See below under "Details" for more information.
NULL
, in which case it is ignored. Otherwise, it must be a numerical vector of of either length 12 if med="UN"
, or of length 3 if med="VC"
. Each vector element provides the value for the parameter corresponding to its index (serial position). Values of NA
are accepted for extraneous parameters. See fgls()
, under "Details," for which parameters correspond to which indices. Note that at least one of covmtxfile.in and theta must be NULL
.
NULL
, in which case generic SNP names are used. Ignored if genfile is NULL
.
gls.batch()
where to look for the family-structure variables "FTYPE"
and "INDIV"
(see below, under "Details"). By default, gls.batch()
first looks in the phenotype file, and if the variables are not found there, then looks in the pedigree file, and if the variables are not there, attempts to create them from information available in the pedigree file, via FSV.frompedi()
. If input.mode=2
, then gls.batch()
skips looking in the phenotype file, and begins by looking in the pedigree file. If input.mode=3
, then gls.batch()
skips looking in the phenotype file and pedigree file, and goes straight to FSV.frompedi()
.
TRUE
, gls.batch()
assigns the values in pedicolname to the column names after the pedigree file has been read in. Defaults to FALSE
. Also see pedifile above, and under "Details" below.
gls.batch()
will assign to the columns of the pedigree file (starting with the first column and moving left to right). The default, c("FAMID","ID","PID","MID","SEX")
, is the familiar "pedigree table" format. The two criteria this vector must have are that it must (1) assign the name "ID" to the column of subject IDs in the pedigree file, and (2) its length must not exceed the number of columns of the pedigree file. If its length is less than the number of columns, columns to which it does not assign a name are discarded. Also see pedifile above, and under "Details" below.
NULL
, in which case no covariates are included.
"UN"
or "VC"
, which are the two RFGLS methods described by Li et al. (2011). If "UN"
(default), which stands for "unstructured," the residual covariance matrix will be constructed from, at most, 12 parameters (8 correlations and 4 variances). If "VC"
, which stands for "variance components," the residual covariance matrix will be constructed from, at most, 3 variance components (additive-genetic, shared-environmental, and unshared-environmental). For more information, see fgls()
.
NULL
, in which case no output file is written. The output file contains the SNP analysis results, so argument outfile is ignored if genfile is NULL
. Note that gls.batch()
will not simultaneously accept outfile=NULL
and return.value=FALSE
.Users are warned that if a file with the same path and filename already exists, gls.batch()
will overwrite it!
TRUE
.
gls.batch()
should actually return a value. Defaults to FALSE
, in which case the function merely returns NULL
. If TRUE
and non-NULL
value was supplied to genfile, the function returns a data frame containing the results of the SNP analyses(i.e., the output file as a data frame). If TRUE
and genfile=NULL
, the function returns the fgls()
output from a regression of the phenotype onto an intercept and covariates (if any). Note that gls.batch()
will not simultaneously accept outfile=NULL
and return.value=FALSE
.
covmtxfile.in=NULL
), will be written. The default is NULL
, in which case no such file is written to disk. See below under "Details" for more information.Users are warned that if a file with the same path and filename already exists, gls.batch()
will overwrite it!
NULL
), will be written. The default is NULL
, in which case no such file is written to disk. See below under "Details" for more information.Users are warned that if a file with the same path and filename already exists, gls.batch()
will overwrite it!
NULL
; otherwise, must be a character string, and if the number of characters in the string is not equal to the size of the largest family in the data, gls.batch()
will produce a warning.
NULL
(which is the default), the check corresponding to that family type is skipped.
indobs=NULL
, which is the default, this check is skipped.
return.value=FALSE
, then gls.batch()
simply returns NULL
. If return.value=TRUE
and genfile=NULL
, then gls.batch()
returns the fgls()
output from a regression of the phenotype onto an intercept and covariates (if any). If return.value=TRUE
and genfile is non-NULL
, then gls.batch()
returns a data frame containing the results of the single-SNP analyses, 1 row per SNP. Specifically, this data frame includes the following named columns:
snp
(character): the names of the SNPs; equal to snp.names if any were supplied.
coef
(numeric): the regression coefficients of the SNPs.
se
(numeric): estimated standard errors of SNPs' regression coefficients.
t.stat
(numeric): t-statistics, i.e. regression coefficients divided by their estimated standard errors.
df
(integer): degrees-of-freedom (see df.residual
, from fgls()
output).
pval
(numeric): two-tailed p-values, from corresponding t-statistics and degrees-of-freedom.
gls.batch()
also has optional side effects of writing as many as three files to disk, depending on arguments outfile, covmtxfile.out, and covmtxparams.out. Note that if a file is written for outfile, that file will contain the single-SNP analysis results described above.
Reference is frequently made throughout this documentation to the "phenotype file," the "genotype file," and so forth, because gls.batch()
was intended to be used with potentially large datafiles to be read from disk. This should be evident from the presence of the word "file" in the names of many of this function's arguments, and the fact that all of those arguments may be character strings providing a filename and path. However, it can also accept the data if the file has already been loaded into R's workspace as a data frame object, in which case "the [whatever] file" should be taken to refer to such a data frame. For details specific to each argument, see above.
The function gls.batch()
first reads in the files and merges them into a data frame with columns of family-structure information, phenotype, covariates, and genotypes. Then, it creates a tlist vector and a sizelist vector, which comprise the family labels and family sizes in the data. Finally, it carries out single-SNP association analyses for all the SNPs in the genotype file.
At the bare minimum, the phenotype file must contain columns named "ID"
, "FAMID"
, and whatever character string is supplied to phen. These columns respectively contain individual IDs, family IDs, and phenotype scores; individual IDs must be unique.
At the bare minimum, the pedigree file need only contain a column consisting of unique individual IDs, corresponding to the label "ID"
in pedicolname. The number of participants in the pedigree file must equal the number of participants in the genotype file, with participants ordered the same way in both files. However, the default value for argument pedicolname (see above) assumes five columns, in the familiar "pedigree table" format.
The phenotype file or pedigree file may also contain the two key family-structure variables, "FTYPE"
(family-type) and "INDIV"
(individual code). If both contain these variables, then by default, they are read from the phenotype file (but see argument input.mode above). There are six recognized family types, which are distinguished primarily by how the offspring in the family are related to one another:
FTYPE=1
, containing MZ twins;
FTYPE=2
, containing DZ twins;
FTYPE=3
, containing adoptees;
FTYPE=4
, containing non-twin full siblings;
FTYPE=5
, "mixed" families containing one biological offspring and one adoptee;
FTYPE=6
, containing "independent observations" who do not fit into a four-person nuclear family.
It is assumed that all offspring except adoptees are biological children of the parents in the family. The four individual codes are:
INDIV=1
is for "Offspring #1;"
INDIV=2
is for "Offspring #2;"
INDIV=3
is for mothers;
INDIV=4
is for fathers.
The distinction between "Offspring #1" and "#2" is mostly arbitrary, except that in "mixed" families(FTYPE=5
), the biological offspring MUST have INDIV=1
, and the adopted offspring, INDIV=2
. If the phenotype file contains variables "FTYPE"
and "INDIV"
, it should be ordered by family ID ("FAMID"
), and by individual code "INDIV"
within family ID. Note that gls.batch()
treats participants with FTYPE=6
as the sole members of their own family units, and not as part of the family corresponding to their family ID.
If neither the phenotype nor pedigree file contain "FTYPE"
and "INDIV"
, gls.batch()
will construct them via FSV.frompedi()
.
When one is conducting parallel analyses on a computing array, judicious use of arguments covmtxfile.in, theta, covmtxparams.out, and covmtxfile.out can save time. For example, suppose one is analyzing different SNP sets in parallel but using a common phenotype file for all. In this case, one could calculate the residual covariance matrix ahead of time and write it to a file. Then, use the same filename and path for argument covmtxfile.in, for all jobs running in parallel. The matrix can be calculated by using gls.batch.get()
and then fgls()
. One could similarly obtain the residual-covariance parameters ahead of time, and supply them as a vector to theta in all jobs running in parallel.
fgls
, pheno
data(pheno)
data(geno)
data(map)
data(pedigree)
data(rescovmtx)
minigwas <- gls.batch(
phenfile=pheno,genfile=data.frame(t(geno)),pedifile=pedigree,
covmtxfile.in=rescovmtx, #<--Precomputed, to save time.
theta=NULL,snp.names=map[,2],input.mode=c(1,2,3),pediheader=FALSE,
pedicolname=c("FAMID","ID","PID","MID","SEX"),
sep.phe=" ",sep.gen=" ",sep.ped=" ",
phen="Zscore",covars="IsFemale",med=c("UN","VC"),
outfile=NULL,col.names=TRUE,return.value=TRUE,
covmtxfile.out=NULL,covmtxparams.out=NULL,
sizeLab=NULL,Mz=NULL,Bo=NULL,Ad=NULL,Mix=NULL,indobs=NULL)
minigwas
Run the code above in your browser using DataLab