fgls: Feasible Generalized Least Squares regression with family GWAS data.

Description

Jointly estimates the fixed-effects coefficients and residual variance-covariance matrix in a generalized least squares model by minimizing the (multivariate-normal) negative loglikelihood function, via optim() in the R base distribution. The residual variance-covariance matrix is block-diagonal sparse, constructed with bdsmatrix() from the bdsmatrix package.

Usage

fgls(fixed, data=parent.frame(), tlist=tlist, sizelist=sizelist, med=c("UN","VC"), vmat=NULL, start=NULL, theta=NULL, drop=NULL, get.hessian=FALSE, optim.method="BFGS", control=list(), weights=NULL, sizeLab=NULL,Mz=NULL,Bo=NULL,Ad=NULL,Mix=NULL,indobs=NULL)

Arguments

fixed

An object of class 'formula' (or one that can be coerced to that class): a symbolic description of the regression model to be fitted. The RHS of the formula contains the fixed effects of the model.

data

An optional data frame, list or environment (or object coercible by as.data.frame() to a data frame) containing the variables in the model, as specified in argument fixed. If not found in data the variables are taken from environment(formula), typically the environment from which fgls() is called. If it contains a column named "ID", then that column's values will be the row and column names of output element sigma (see below, under "Value").

tlist

The character vector of the family labels ("famlab") in the data. The length of the vector equals the number of family units. It should be ordered in the same order as the families appear in the data. Object tlist is created by the gls.batch() and gls.batch.get() functions.

sizelist

The integer vector of the family sizes in the data. The length of the vector equals the number of family units. It should be ordered in the same order as the families appear in the data. Object sizelist is created by the gls.batch() and gls.batch.get() functions.

med

A character string, either "UN" or "VC", which are the two RFGLS methods described by Li et al. (2011). If "UN" (default), which stands for "unstructured," the residual covariance matrix will be constructed from, at most, 12 parameters (8 correlations and 4 variances). If "VC", which stands for "variance components," the residual covariance matrix will be constructed from, at most, 3 variance components (additive-genetic, shared-environmental, and unshared-environmental).

vmat

The previously estimated (or known) residual covariance matrix (for conducting Rapid FGLS). If it is NULL (default), the matrix will either be jointly estimated with the fixed-effects coefficients (FGLS), or will be constructed values supplied to theta. If not NULL, must be either (1) an object of class 'bdsmatrix' (from the bdsmatrix package), or (2) a character string specifying the filename and path for a single-column text file, with header, containing the "blocks" of a bdsmatrix. Note that at least one of vmat and theta must be NULL.

start

A numeric vector of initial values for the residual-covariance parameters. If NULL (default), generic start values are used. Otherwise, it must be a numerical vector of either length 12 if med="UN", or of length 3 if med="VC". Each vector element provides the initial value for the parameter corresponding to its index (serial position). Values of NA are accepted, and will be replaced with the generic start value for that parameter. See below under "Details" for which parameters correspond to which indices. Users should bear in mind that especially poor start values can cause optimization to fail. Ignored if no residual-covariance parameters are estimated.

theta

A numeric vector of previously estimated (or known) residual-covariance parameters. Defaults to NULL, in which case it is ignored. Otherwise, it must be a numerical vector of of either length 12 if med="UN", or of length 3 if med="VC". Each vector element provides the value for the parameter corresponding to its index (serial position). Values of NA are accepted for extraneous parameters. See below under "Details" for which parameters correspond to which indices. The residual covariance matrix is constructed from the elements of theta, exactly as-is. Note that at least one of vmat and theta must be NULL.

drop

An integer vector of indices (serial positions) specifying which residual-covariance parameters to drop. Dropped parameters are not estimated. In addition to those specified by drop, fgls() automatically identifies which parameters are completely unidentified from the data (i.e., zero observations in the data are informative about them), and drops them as well. If drop is NULL (default), no user-specified parameters are dropped. Otherwise, it must be a vector of integers, either between 1 and 12 (inclusive) if med="UN", or between 1 and 2 (inclusive) if med="VC". Note that if a user-specified-dropped parameter ends up being needed to construct the residual covariance matrix, its value is taken to be that of its OLS equivalent: zero for correlations (med="UN") and for the familial variance components (med="VC"), and the OLS residual variance for variances (med="UN"). It may be prudent to drop parameters when very few observations in the data are informative about them, which can at least save computation time. See below under "Details" for which parameters correspond to which indices. Ignored if no residual-covariance parameters are estimated.

get.hessian

Logical; default is FALSE. If TRUE, fgls() will include the Hessian matrix from optim() in its output list. Otherwise, the entry 'hessian' in the list will be NULL. Ignored if no residual-covariance parameters are estimated.

optim.method

Character string, passed as method to optim(). Ignored if no residual-covariance parameters are estimated. The default, "BFGS", is usually fast and is recommended for general use. If method "L-BFGS-B" is used, fgls() will supply optim() with reasonable box constraints on the parameters, intended for use with optim()'s default control parameters (see argument control below).

Method "BFGS" (the default) may fail when any of the residual-covariance parameters are poorly identified from the data. In these cases, it may be wise simply to drop the offending parameters. Other optimization methods, including "L-BFGS-B", can succeed where "BFGS" fails. Method "SANN" should not generally be relied upon to find the global optimum, but it can sometimes produce reasonable, approximate solutions in instances where no other method works. As a last-resort diagnostic, one can combine optim.method="SANN" with hessian=TRUE, since the resulting Hessian matrix may provide clues as to which parameters are causing problems.

control

A list of control parameters passed to optim(), intended for advanced users. The default is also optim()'s default, which should be adequate for general use.

weights

A numeric vector of weights, with length equal to the number of observations in the data. Defaults to NULL.

sizeLab

This is an optional argument, and may be eliminated in future versions of this package. Defaults to NULL; otherwise, must be a character string. If the number of characters in the string is not equal to the size of the largest family in the data, fgls() will produce a warning.

Mz, Bo, Ad, Mix, indobs

These arguments are deprecated, and their values are ignored. They are retained in this package version for legacy reasons, but will be eliminated in future versions.

Value

ctable

Table of coefficients reminiscent of output from summary.lm(). Each fixed-effect term (including the intercept) has one row of the table, which are ordered as the terms appear in argument fixed. Each row contains a point estimate, an estimated standard error, a t-statistic, and a two-tailed p-value.

Rsqd

The generalized-least-squares coefficient of determination, a la Buse (1973).

estimates

The vector of MLEs of the parameters used to construct the residual covariance matrix, ordered as in the lists above, under "Details." Dropped parameters are given value NA. If no residual-covariance parameters are estimated, will instead be a single NA.

drop

A vector of parameter indices, representing which residual-covariance parameters were dropped (not estimated). See above, under "Details," for which parameters correspond to which indices. NULL if no parameters were dropped or estimated.

iter

NULL if no residual-covariance parameters were estimated. Otherwise, a single-row data frame, containing miscellaneous output pertaining to the optimization, specifically, the following named columns:

iterations (integer): the number of function iterations, as returned from optim().
convergence (integer): convergence code, as returned from optim(); value 0 means that convergence was successful.
message (character): additional information from the optimizer; a single whitespace means that optim() returned a message of NULL.
first_try (logical): Did fgls()'s first attempt at optimization succeed? If FALSE, then during the second attempt, fgls() had to force the block matrices of the residual covariance matrix to be positive-definite, as described above, under "Details."

loglik

The negative loglikelihood, at the solution. If the residual-covariance parameters were estimated, then it equals -1 times the maximized joint loglikelihood of those parameters and the regression coefficients. If the residual-covariance parameter values were provided with argument vmat or theta, then it equals -1 times the maximized joint loglikelihood of the regression coefficients, conditional on the values supplied for the residual-covariance parameters.

sigma

The residual covariance matrix. It is of class 'bdsmatrix'. Its row and column names are taken from the column named "ID", if any, in argument data, otherwise its row and column names are sequential numbers. One of its slots, sigma@blocks, can be written to a single-column text file and subsequently read in by gls.batch(). Due to its potential size, it is not advised to return sigma to R's standard output or print it to the console.

hessian

If get.hessian=TRUE and residual-covariance parameters were estimated, the Hessian matrix from optim() for those parameters; NULL otherwise.

n

Sample size (i.e., number of individual participants), after excluding those with missing data (NA's).

df.residual

Residual degrees of freedom in the feasible generalized-least-squares regression, as returned by lm(). Note that it only reflects the number of regression coefficients, and not the number of residual-covariance parameters that were estimated.

residuals

Residuals from the feasible generalized-least-squares regression. It is a vector of length n, i.e. it is not padded with NA's for participants with missing data.

fitted.values

Predicted phenotype scores from the feasible generalized-least-squares regression. It is a vector of length n, i.e. it is not padded with NA's for participants with missing data.

variance

The estimated covariance matrix for (the sampling distribution of) the fixed-effects regression coefficients.

call

Echo of fgls() function call.

fgls()

Details

Function fgls() was originally intended to be called automatically, from within gls.batch(). However, calling it directly is likely to be useful to advanced users. The difficulty when directly invoking fgls() is supplying the function with arguments tlist and sizelist. But, these can be obtained easily via gls.batch.get().

When residual-covariance parameters are to be estimated, fgls() will attempt optimization, at most, two times. If the initial attempt fails, fgls() prints a message saying so to the console, and tries a second time. On the second attempt, before each evaluation of the objective function, the blocks composing the block-diagonal residual covariance matrix are forced to be positive definite. This uses nearPD() from the Matrix package, which turns each block matrix into its nearest positive-definite approximation (where "nearest" is meant in a least-squares sense). Forcing positive-definiteness in this way is only used for the second attempt, and not for the initial attempt (which has its own way of ensuring a positive-definite solution), since it slows down optimization and is unnecessary when the parameters are well-identified. Furthermore, it can have consequences the user might not expect. For instance, in fgls()'s output (see below, under "Value"), the elements of the residual covariance matrix sigma might not correspond to the parameter estimates in estimates, or covariances that are supposed to be the same across families might not be so in the actual matrix sigma. Nevertheless, the second attempt may succeed when the initial attempt fails.

When med="UN", the residual covariance matrix is constructed from, at most, 12 parameters--8 correlations and 4 variances. Below is an enumerated list of those 12 parameters, in which the number of each list entry is the index (serial position) of that parameter, and the quoted text is the element name of each estimated parameter as it appears in fgls() output:

"cor(m,f)", correlation between mothers and fathers.
"cor(c/b,m)", correlation between biological offspring and mothers.
"cor(c/b,f)", correlation between biological offspring and fathers.
"cor(c,c)", MZ-twin correlation.
"cor(b,b)", full-sibling (DZ-twin) correlation.
"cor(a,m)", correlation between adoptees and mothers.
"cor(a,f)", correlation between adoptees and fathers.
"cor(a,a)", adoptive-sibling correlation.
"var(O)", offspring variance.
"var(m)", mother variance.
"var(f)", father variance.
"var(ind)", variance for "independent observations."

When med="VC", the residual covariance matrix is constructed from, at most, 3 variance components. Below is an enumerated list of those 3 parameters, in which the number of each list entry is the index (serial position) of that parameter, and the quoted text is the label of each estimated parameter as it appears in fgls() output:

"A", additive-genetic variance.
"C", shared-environmental variance (compound-symmetric within families).
"E", unshared-environmental variance (which cannot be dropped).

Additive-genetic variance contributes to covariance between family members commensurately to the expected proportion of segregating alleles they share: 1.0 for MZ twins, 0.5 for first-degree relatives, 0 for spouses and adoptive relatives. Shared-environmental variance, as defined here, represents covariance between biologically unrelated family members (including spouses).

In package version 1.0, arguments subset and na.action were accepted, and passed to lm(). Neither are accepted any longer. Subsetting should be done before directly calling fgls(); the function handles NA's in the data by what is (in effect) na.action=na.omit.

References

Li X, Basu S, Miller MB, Iacono WG, McGue M: A Rapid Generalized Least Squares Model for a Genome-Wide Quantitative Trait Association Analysis in Families. Human Heredity 2011;71:67-82 (DOI: 10.1159/000324839)

Buse, A: Goodness of Fit in Generalized Least Squares Estimation The American Statistician 1973;27:106-108

Examples

Run this code

data(pheno)
data(geno)
data(map)
data(pedigree)
data(rescovmtx)
foo <- gls.batch.get(
  phenfile=pheno,genfile=data.frame(t(geno)),pedifile=pedigree,
  covmtxfile.in=NULL,theta=NULL,snp.names=map[,2],input.mode=c(1,2,3),
  pediheader=FALSE,pedicolname=c("FAMID","ID","PID","MID","SEX"),
  sep.phe=" ",sep.gen=" ",sep.ped=" ",
  phen="Zscore",covars="IsFemale",med=c("UN","VC"),
  outfile,col.names=TRUE,return.value=FALSE,
  covmtxfile.out=NULL,
  covmtxparams.out=NULL,
  sizeLab=NULL,Mz=NULL,Bo=NULL,Ad=NULL,Mix=NULL,indobs=NULL)

bar <- fgls(
  Zscore ~ rs3934834 + IsFemale, data=foo$test.dat, tlist=foo$tlist,
  sizelist=foo$sizelist,med=c("UN","VC"), 
  vmat=rescovmtx, #<--Resid. cov. matrix from fgls onto IsFemale only.
  start=NULL, theta=NULL, drop=NULL, get.hessian=FALSE, 
  optim.method="BFGS", control=list(), weights=NULL,
  sizeLab=NULL,Mz=NULL,Bo=NULL,Ad=NULL,Mix=NULL,indobs=NULL)

bar$ctable
##  To simultaneously estimate residual covariance matrix
##  and regression coefficients for rs3934834 & IsFemale,
##  use the same syntax, except with vmat = NULL .

Run the code above in your browser using DataLab