GWAS2: Genome wide association study

Description

This function is deprecated. Use GWAS instead. Now the GWAS function can run both types of models; formula-based and matrix-based models. Type ?GWAS.

For tutorials on how to perform different analysis with sommer please look at the vignettes by typing in the terminal:

vignette("sommer.start")

vignette("sommer")

Usage

GWAS2(fixed, random, rcov, data, weights, 
    iters=20, tolpar = 1e-03, tolparinv = 1e-06, 
    init=NULL, constraints=NULL,method="NR", 
    getPEV=TRUE,na.method.X="exclude",
    na.method.Y="exclude",return.param=FALSE, 
    date.warning=TRUE,verbose=TRUE,
    M=NULL, gTerm=NULL, n.PC = 0, min.MAF = 0.05, 
    n.core = 1, P3D = TRUE)

Arguments

fixed

a formula specifying the response variable(s) and fixed effects, i.e:

Yield ~ Location for univariate models

cbind(Yield,color) ~ Location for multivariate models.

random

a formula specifying the name of the random effects, i.e. random= ~ genotype + year.

Useful functions can be used to fit heterogeneous variances and other special models (see 'Special Functions' in the Details section for more information):

vs(...,Gu,Gt,Gtc) is the main function to specify special variance-covariance structures for random effects. On the ... argument you provide the unknown variance-covariance structures (i.e. us,ds,at,cs) and the random effect where such covariance structure will be used (the random effect of interest).

at(x,levs) can be used to specify heterogeneous variance for specific levels of a random effect

ds(x), us(x), cs(x) can be used to specify unknown diagonal, unstructured and customized covariance structures respectively among levels of a random effect to be estimated by REML.

overlay(...,rlist,prefix) can be used to specify overlay of design matrices of random effects

spl2D(...) can be used to fit a 2-dimensional spline (i.e. spatial modeling; see Special functions section below).

rcov

a formula specifying the name of the error term, i.e. rcov= ~ units.

The functions that can be used to fit heterogeneous residual variances are the same used on the random term but the random effect is always "units", i.e. rcov=~vs(ds(Location),units)

data

a data frame containing the variables specified in the formulas for response, fixed, and random effects.

weights

name of the covariate for weights. To be used for the product R = Wsi*R*Wsi, where * is the matrix product, Wsi is the square root of the inverse of W and R is the residual matrix.

iters

Maximum number of iterations allowed. Default value is 15.

tolpar

Convergence criteria.

tolparinv

tolerance parameter for matrix inverse used when singularities are encountered.

init

initial values for the variance components. By default this is NULL and variance components are estimated by the method selected, but in case the user want to provide initial values for ALL var-cov components this argument is functional. It has to be provided as a list or an array, where each list element is one variance component and if multitrait model is pursued each element of the list is a matrix of variance covariance components among traits. Initial values can also be provided in the Gt argument of the vs function.Is highly encouraged to use the Gt and Gtc arguments of the vs function instead of this argument

constraints

when initial values are provided these have to be accompanied by their constraints. See the vs function for more details on the constraints. Is highly encouraged to use the Gt and Gtc arguments of the vs function instead of this argument.

method

this refers to the method or algorithm to be used for estimating variance components. Direct-inversion Newton-Raphson NR and Average Information AI (Tunnicliffe 1989; Gilmour et al. 1995; Lee et al. 2015), and EMMA efficient mixed model association (Kang et al. 2008).

getPEV

a TRUE/FALSE value indicating if the program should return the predicted error variance and variance for random effects. This option is provided since this can take a long time for certain models where p > n by a big extent.

na.method.X

one of the two possible values; "include" or "exclude". If "include" is selected then the function will impute the X matrices for fixed effects with the median value. If "exclude" is selected it will get rid of all rows with missing values for the X (fixed) covariates. The default is "exclude". The "include" option should be used carefully.

na.method.Y

one of the three possible values; "include", "include2" or "exclude". If "include" is selected then the function will impute the response variables with the median value. The difference between "include" and "include2" is only available in the multitrait models when the imputation can happen for the entire matrix of responses or only for complete cases ("include2"). If "exclude" is selected it will get rid of rows in responses where missing values are present for the estimation of variance components. The default is "exclude".

return.param

a TRUE/FALSE value to indicate if the program should return the parameters used for modeling without fitting the model.

date.warning

a TRUE/FALSE value to indicate if the program should warn you when is time to update the sommer package.

verbose

a TRUE/FALSE value to indicate if the program should return the progress of the iterative algorithm.

The marker matrix containing the marker scores for each line, coded as -1,0,1 = aa,Aa,AA, individuals in rows and markers in columns. No additional columns should be provided, is a purely numerical matrix.

gTerm

a character vector indicating the genetic term in the model.

n.PC

Number of principal components to include as fixed effects. Default is 0 (equals K model).

min.MAF

Specifies the minimum minor allele frequency (MAF). If a marker has a MAF less than min.MAF, it is assigned a zero score.

n.core

Setting n.core > 1 will enable parallel execution on a machine with multiple cores (use only at UNIX command line).

P3D

When P3D=TRUE, variance components are estimated by REML only once, without any markers in the model. When P3D=FALSE, variance components are estimated by REML for each marker separately.

Value

If all parameters are correctly indicated the program will return a list with the following information:

the inverse of the phenotypic variance matrix V^- = (ZGZ+R)^-1

sigma

a list with the values of the variance-covariance components with one list element for each random effect.

sigma_scaled

a list with the values of the scaled variance-covariance components with one list element for each random effect.

sigmaSE

standard errors for the variance covariance components.

Beta

a data frame for trait BLUEs (fixed effects).

VarBeta

a variance-covariance matrix for trait BLUEs

a list (one element for each random effect) with a data frame for trait BLUPs.

VarU

a list (one element for each random effect) with the variance-covariance matrix for trait BLUPs.

PevU

a list (one element for each random effect) with the predicted error variance matrix for trait BLUPs.

fitted

Fitted values y.hat=XB

residuals

Residual values e = Y - XB

AIC

Akaike information criterion

BIC

Bayesian information criterion

convergence

a TRUE/FALSE statement indicating if the model converged.

monitor

The values of log-likelihood and variance-covariance components across iterations during the REML estimation.

method

The method for extimation of variance components specified by the user.

call

Formula for fixed, random and rcov used.

scores

marker scores (-log_10p) for the traits.

betasm

marker effects.

Fstatm

F statistic associate to the test.

r2m

R2 value for each marker.

Details

Special Functions

vs(at(x,levels),y)

can be used to specify heterogeneous variance for the "y"" factor covariate at specific levels of the factor covariate "x", i.e. random=~vs(at(Location,c("A","B")),ID) fits a variance component for ID at levels A and B of the factor covariate Location.

vs(ds(x),y)

can be used to specify a diagonal covariance structure for the "y"" covariate for all levels of the factor covariate "x", i.e. random=~vs(ds(Location,ID) fits a variance component for ID at all levels of the factor covariate Location.

vs(us(x),y)

can be used to specify an unstructured covariance structure for the "y"" covariate for all levels of the factor covariate "x", i.e. random=~vs(us(Location),ID) fits variance and covariance components for ID at all levels of the factor covariate Location.

vs(overlay(...,rlist=NULL,prefix=NULL))

can be used to specify overlay of design matrices between consecutive random effects specified, i.e. random=~overlay(male,female) overlays (overlaps) the incidence matrices for the male and female random effects to obtain a single variance component for both effects. The `rlist` argument is a list with each element being a numeric value that multiplies the incidence matrix to be overlayed. See overlay for details.Can be combined with vs().

vs(spl2D(x.coord, y.coord, at, at.levels, type, nseg, pord, degree, nest.div))

can be used to fit a 2-dimensional spline (i.e. spatial modeling) using coordinates x.coord and y.coord (in numeric class). The 2D spline can be fitted at specific levels using the at and at.levels arguments. For example random=~spl2D(x.coord=Row,y.coord=Range,at=FIELD).

For a short tutorial on how to use this special functions you can look at the vignettes by typing in the terminal:

vignette('sommer.start')

Bug report and contact

If you have any questions or suggestions please post it in https://stackoverflow.com and send me an email with the link at cova_ruber@live.com.mx

Example Datasets

The package has been equiped with several datasets to learn how to use the sommer package:

* DT_halfdiallel and DT_fulldiallel datasets have examples to fit half and full diallel designs.

* DT_h2 to calculate heritability

* DT_cornhybrids and DT_technow datasets to perform genomic prediction in hybrid single crosses

* DT_wheat dataset to do genomic prediction in single crosses in species displaying only additive effects.

* DT_cpdata dataset to fit genomic prediction models within a biparental population coming from 2 highly heterozygous parents including additive, dominance and epistatic effects.

* DT_polyploid to fit genomic prediction and GWAS2 analysis in polyploids.

* DT_gryphon data contains an example of an animal model including pedigree information.

* DT_btdata dataset contains an animal (birds) model.

Additional Functions

Other functions such as summary, fitted, randef (notice here is randef not ranef), anova, variogram, residuals, coef and plot applicable to typical linear models can also be applied to models fitted using the GWAS2-type of functions.

Additional functions for genetic analysis have been included such as heritability (h2.fun), build a genotypic hybrid marker matrix (build.HMM), plot of genetic maps (map.plot), creation of manhattan plots (manhattan). If you need to use pedigree you need to convert your pedigree into a relationship matrix (i.e. use the getA function from the pedigreemm package).

Useful functions for analyzing field trials are included such as the spl2D, spatPlots, and fill.design.

Models Enabled

For details about the models enabled and more information about the covariance structures please check the help page of the package (sommer).

References

Covarrubias-Pazaran G. Genome assisted prediction of quantitative traits using the R package sommer. PLoS ONE 2016, 11(6): doi:10.1371/journal.pone.0156744

Covarrubias-Pazaran G. 2018. Software update: Moving the R package sommer to multivariate mixed models for genome-assisted prediction. doi: https://doi.org/10.1101/354639

Bernardo Rex. 2010. Breeding for quantitative traits in plants. Second edition. Stemma Press. 390 pp.

Gilmour et al. 1995. Average Information REML: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51(4):1440-1450.

Kang et al. 2008. Efficient control of population structure in model organism association mapping. Genetics 178:1709-1723.

Lee, D.-J., Durban, M., and Eilers, P.H.C. (2013). Efficient two-dimensional smoothing with P-spline ANOVA mixed models and nested bases. Computational Statistics and Data Analysis, 61, 22 - 37.

Lee et al. 2015. MTG2: An efficient algorithm for multivariate linear mixed model analysis based on genomic information. Cold Spring Harbor. doi: http://dx.doi.org/10.1101/027201.

Maier et al. 2015. Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am J Hum Genet; 96(2):283-294.

Rodriguez-Alvarez, Maria Xose, et al. Correcting for spatial heterogeneity in plant breeding experiments with P-splines. Spatial Statistics 23 (2018): 52-71.

Searle. 1993. Applying the EM algorithm to calculating ML and REML estimates of variance components. Paper invited for the 1993 American Statistical Association Meeting, San Francisco.

Yu et al. 2006. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Genetics 38:203-208.

Tunnicliffe W. 1989. On the use of marginal likelihood in time series model estimation. JRSS 51(1):15-27.

Zhang et al. 2010. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42:355-360.