RGWAS.normal.interaction: Perform normal GWAS including interaction (test each single SNP)

Description

This function performs single-SNP GWAS (genome-wide association studies), including the interaction between SNP and genetic background (or other environmental factors). The model of GWAS is quite similar to the one in the `RGWAS.normal` function:

$$y = X \beta + S _ {i} \alpha _ {i} + Q v + Z u + \epsilon,$$

where $y$ is the vector of phenotypic values, $X \beta$, $S _ {i} \alpha _ {i}$, $Q v$ are the terms of fixed effects, $Z u$ is the term of random effects and $e$ is the vector of residuals. $X \beta$ indicates all of the fixed effects other than the effect of SNPs to be tested and of population structure, and often this term also plays a role as an intercept. For $S _ {i} \alpha _ {i}$, this term is only the difference compared to the model for normal single-SNP GWAS. Here, $S _ {i}$ includes the ith marker of genotype data and the interaction variables between the ith marker of genotype data and the matrix representing the genetic back ground (or some environmental factors). $\alpha _ {i}$ is the cooresponding effects of that marker and the interaction term. $Q v$ is the term to correct the effect of population structure. $Z u$ is the term of polygenetic effects, and suppose that $u$ follows the multivariate normal distribution whose variance-covariance matrix is the genetic covariance matrix. $u \sim MVN (0, K \sigma_{u}^{2})$. Finally, the residual term is assumed to identically and independently follow a normal distribution as shown in the following equation. $e \sim MVN (0, I \sigma_{e}^{2})$.

Usage

RGWAS.normal.interaction(
  pheno,
  geno,
  ZETA = NULL,
  package.MM = "gaston",
  covariate = NULL,
  covariate.factor = NULL,
  structure.matrix = NULL,
  interaction.with.SNPs = NULL,
  interaction.mat.method = "PCA",
  n.interaction.element = 1,
  interaction.group = NULL,
  n.interaction.group = 3,
  interaction.group.method = "find.clusters",
  n.PC.dapc = 1,
  test.method.interaction = "simultaneous",
  n.PC = 0,
  min.MAF = 0.02,
  P3D = TRUE,
  n.core = 1,
  parallel.method = "mclapply",
  sig.level = 0.05,
  method.thres = "BH",
  plot.qq = TRUE,
  plot.Manhattan = TRUE,
  plot.method = 1,
  plot.col1 = c("dark blue", "cornflowerblue"),
  plot.col2 = 1,
  plot.type = "p",
  plot.pch = 16,
  saveName = NULL,
  main.qq = NULL,
  main.man = NULL,
  plot.add.last = FALSE,
  return.EMM.res = FALSE,
  optimizer = "nlminb",
  thres = TRUE,
  skip.check = FALSE,
  verbose = TRUE,
  verbose2 = FALSE,
  count = TRUE,
  time = TRUE
)

Value

$D: List of data.frame which contains the information of the map you input and the results of RGWAS (-log10(p)) which correspond to the map for each tested effect.

$thres

A matrix which contains the information of threshold determined by FDR = 0.05. (each trait x each tested effect)

$EMM.res

This output is a list which contains the information about the results of "EMM" perfomed at first in regular GWAS. If you want to know details, see the description for the function "EMM1" or "EMM2".

Arguments

pheno

Data frame where the first column is the line name (gid). The remaining columns should be a phenotype to test.

geno

Data frame with the marker names in the first column. The second and third columns contain the chromosome and map position. Columns 4 and higher contain the marker scores for each line, coded as [-1, 0, 1] = [aa, Aa, AA].

ZETA

A list of covariance (relationship) matrix (K: $m \times m$) and its design matrix (Z: $n \times m$) of random effects. Please set names of list "Z" and "K"! You can use more than one kernel matrix. For example,

ZETA = list(A = list(Z = Z.A, K = K.A), D = list(Z = Z.D, K = K.D))

Z.A, Z.D: Design matrix ($n \times m$) for the random effects. So, in many cases, you can use the identity matrix.

K.A, K.D

Different kernels which express some relationships between lines.

For example, K.A is additive relationship matrix for the covariance between lines, and K.D is dominance relationship matrix.

package.MM

The package name to be used when solving mixed-effects model. We only offer the following three packages: "RAINBOWR", "MM4LMM" and "gaston". Default package is `gaston`. See more details at EM3.general.

covariate

A $n \times 1$ vector or a $n \times p _ 1$ matrix. You can insert continuous values, such as other traits or genotype score for special markers. This argument is regarded as one of the fixed effects.

covariate.factor

A $n \times p _ 2$ dataframe. You should assign a factor vector for each column. Then RGWAS changes this argument into model matrix, and this model matrix will be included in the model as fixed effects.

structure.matrix

You can use structure matrix calculated by structure analysis when there are population structure. You should not use this argument with n.PC > 0.

interaction.with.SNPs

A $m \times q$ matrix. Interaction between each SNP and this matrix will also be tested. For example, principal components of genomic relationship matrix can be used as this matrix to test the interaction between SNPs and the genetic background. Or you can test the interaction with some environmental factors by inputting some omics data that represent the environment. (Test inluding GxE effects.)

interaction.mat.method

Method to compute `interaction.with.SNPs` when `interaction.with.SNPs` is NULL. We offer the following four different methods:

"PCA": Principal component analysis for genomic relationship matrix (`K` in `ZETA`) using `prcomp` function

"LDA": Linear discriminant analysis with independent variables as genomic relationship matrix (`K` in `ZETA`) and dependent variables as some group information (`interaction.group`) using `lda` function

"GROUP": Dummy variables for some group information (`interaction.group`)

"DAPC": Perform LDA to the principal components of PCAfor genomic relationship matrix (`K` in `ZETA`) using `dapc` function in `adgenet` package. See Jombart et al., 2010 and dapc for more details.

n.interaction.element

Number of elements (variables) that are included in the model as interaction term for `interaction.with.SNPs`. If `interaction.with.SNPs = NULL` and `n.interaction.element = 0`, then the standard SNP-based GWAS will be performed by `RGWAS.normal` function.

interaction.group

When you use "LDA", "GROUP", or "DAPC", the information on groups (e.g., subgroups for the population) will be required. You can set a vector of group names (or clustering ids) for each genotype as this argument. This vector should be factor.

n.interaction.group

When `interaction.group = NULL`, `interaction.group` will be automatically determined by using k-medoids method (`pam` function in `cluster` package). You should specify the number of groups by this argument to decide `interaction.group`.

interaction.group.method

The method to perform clustering when `interaction.group = NULL`. We offer the following two methods "find.clusters" and "pam". "find.clusters" performs `adegenet::find.clusters` functions to conduct successive K-means clustering, "pam" performs `cluster::pam` functions to conduct k-medoids clustering. See find.clusters and pam for more details.

n.PC.dapc

Number of principal components to be used for `adegenet::find.clusters` or `adegenet::dapc` functions.

test.method.interaction

Method for how to test SNPs and the interactions between SNPs and the genetic background. We offer three methods as follows:

"simultaneous": All effects (including SNP efects) are tested simultanously.

"snpSeparate": SNP effects are tested as one effect, and the other interaction effects are simulateneously.

"oneByOne": All efects are tested separately, one by one.

n.PC

Number of principal components to include as fixed effects. Default is 0 (equals K model).

min.MAF

Specifies the minimum minor allele frequency (MAF). If a marker has a MAF less than min.MAF, it is assigned a zero score.

P3D

When P3D = TRUE, variance components are estimated by REML only once, without any markers in the model. When P3D = FALSE, variance components are estimated by REML for each marker separately.

n.core

Setting n.core > 1 will enable parallel execution on a machine with multiple cores. This argument is not valid when `parallel.method = "furrr"`.

parallel.method

Method for parallel computation. We offer three methods, "mclapply", "furrr", and "foreach".

When `parallel.method = "mclapply"`, we utilize pbmclapply function in the `pbmcapply` package with `count = TRUE` and mclapply function in the `parallel` package with `count = FALSE`.

When `parallel.method = "furrr"`, we utilize future_map function in the `furrr` package. With `count = TRUE`, we also utilize progressor function in the `progressr` package to show the progress bar, so please install the `progressr` package from github (https://github.com/HenrikBengtsson/progressr). For `parallel.method = "furrr"`, you can perform multi-thread parallelization by sharing memories, which results in saving your memory, but quite slower compared to `parallel.method = "mclapply"`.

When `parallel.method = "foreach"`, we utilize foreach function in the `foreach` package with the utilization of makeCluster function in `parallel` package, and registerDoParallel function in `doParallel` package. With `count = TRUE`, we also utilize setTxtProgressBar and txtProgressBar functions in the `utils` package to show the progress bar.

We recommend that you use the option `parallel.method = "mclapply"`, but for Windows users, this parallelization method is not supported. So, if you are Windows user, we recommend that you use the option `parallel.method = "foreach"`.

sig.level

Significance level for the threshold. The default is 0.05.

method.thres

Method for detemining threshold of significance. "BH" and "Bonferroni are offered.

plot.qq

If TRUE, draw qq plot.

plot.Manhattan

If TRUE, draw manhattan plot.

plot.method

If this argument = 1, the default manhattan plot will be drawn. If this argument = 2, the manhattan plot with axis based on Position (bp) will be drawn. Also, this plot's color is changed by all chromosomes.

plot.col1

This argument determines the color of the manhattan plot. You should substitute this argument as color vector whose length is 2. plot.col1[1] for odd chromosomes and plot.col1[2] for even chromosomes

plot.col2

Color of the manhattan plot. color changes with chromosome and it starts from plot.col2 + 1 (so plot.col2 = 1 means color starts from red.)

plot.type

This argument determines the type of the manhattan plot. See the help page of "plot".

plot.pch

This argument determines the shape of the dot of the manhattan plot. See the help page of "plot".

saveName

When drawing any plot, you can save plots in png format. In saveName, you should substitute the name you want to save. When saveName = NULL, the plot is not saved.

main.qq

The title of qq plot. If this argument is NULL, trait name is set as the title.

main.man

The title of manhattan plot. If this argument is NULL, trait name is set as the title.

plot.add.last

If saveName is not NULL and this argument is TRUE, then you can add lines or dots to manhattan plots. However, you should also write "dev.off()" after adding something.

return.EMM.res

When return.EMM.res = TRUE, the results of equation of mixed models are included in the result of RGWAS.

optimizer

The function used in the optimization process. We offer "optim", "optimx", and "nlminb" functions. This argument is only valid when `package.MM = 'RAINBOWR'`.

thres

If thres = TRUE, the threshold of the manhattan plot is included in the result of RGWAS. When return.EMM.res or thres is TRUE, the results will be "list" class.

skip.check

As default, RAINBOWR checks the type of input data and modifies it into the correct format. However, it will take some time, so if you prepare the correct format of input data, you can skip this procedure by setting `skip.check = TRUE`.

verbose

If this argument is TRUE, messages for the current steps will be shown.

verbose2

If this argument is TRUE, welcome message will be shown.

count

When count is TRUE, you can know how far RGWAS has ended with percent display.

time

When time is TRUE, you can know how much time it took to perform RGWAS.

Details

P-value for each marker is calculated by performing F-test against the F-value as follows (Kennedy et al., 1992).

$$F = \frac { ( L' \hat { b } )' [ L' ( X' H ^ { - 1 } X ) ^ { - 1 } L ] ^ { - 1 } ( L' \hat { b } ) } { f \hat { \sigma }_ { u } ^ { 2 } },$$

where $b$ is the vector of coefficients of the fixed effects, which combines $\beta$, $\alpha _ {i}$, $v$ in the horizontal direction and $L$ is a matrix to indicate which effects in $b$ are tested. $H$ is calculated by dividing the estimated variance-covariance matrix for the phenotypic values by $\sigma _ { u } ^ { 2 }$, and is calculated by $H = Z K Z' + \hat{\lambda} I$. $\hat{\lambda}$ is the maximum likelihood estimator of the ratio between the residual variance and the additive genetic variance. $\hat{b}$ is the maximum likelihood estimator of $b$ and is calculated by $\hat { b } = ( X' H ^ { - 1 } X ) ^ { - 1 } X' H ^ { - 1 } y $. $f$ is the number of the fixed effects to be tested, and $\hat { \sigma }_ { u } ^ { 2 }$ is estimated by the following formula. $$\hat { \sigma }_ { u } ^ { 2 } = \frac { ( y - X \hat { b } )' H ^ { - 1 } ( y - X \hat { b } ) } { n - p },$$ where $n$ is the sample size and $p$ is the number of the all fixed effects. We calculated each p-value using the fact that the above F-value follows the F distribution with the degree of freedom ($f$,$n - p$).

References

Kennedy, B.W., Quinton, M. and van Arendonk, J.A. (1992) Estimation of effects of single genes on quantitative traits. J Anim Sci. 70(7): 2000-2012.

Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci. 100(16): 9440-9445.

Yu, J. et al. (2006) A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 38(2): 203-208.

Kang, H.M. et al. (2008) Efficient Control of Population Structure in Model Organism Association Mapping. Genetics. 178(3): 1709-1723.

Kang, H.M. et al. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 42(4): 348-354.

Zhang, Z. et al. (2010) Mixed linear model approach adapted for genome-wide association studies. Nat Genet. 42(4): 355-360.

Endelman, J.B. (2011) Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP. Plant Genome J. 4(3): 250.

Endelman, J.B. and Jannink, J.L. (2012) Shrinkage Estimation of the Realized Relationship Matrix. G3 Genes, Genomes, Genet. 2(11): 1405-1413.

Zhou, X. and Stephens, M. (2012) Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 44(7): 821-824.

Jombart, T., Devillard, S. and Balloux, F. (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet 11(1), 94.

Examples

Run this code

# \dontshow{
  ### Import RAINBOWR
  require(RAINBOWR)

  ### Load example datasets
  data("Rice_Zhao_etal")
  Rice_geno_score <- Rice_Zhao_etal$genoScore
  Rice_geno_map <- Rice_Zhao_etal$genoMap
  Rice_pheno <- Rice_Zhao_etal$pheno


  ### Select one trait for example
  trait.name <- "Flowering.time.at.Arkansas"
  y <- as.matrix(Rice_pheno[1:30, trait.name, drop = FALSE])
  # use first 30 accessions

  ### Remove SNPs whose MAF <= 0.05
  x.0 <- t(Rice_geno_score)
  MAF.cut.res <- MAF.cut(x.0 = x.0, map.0 = Rice_geno_map)
  x <- MAF.cut.res$x
  map <- MAF.cut.res$map


  ### Estimate genomic relationship matrix (GRM)
  K.A <- calcGRM(genoMat = x)


  ### Modify data
  modify.data.res <- modify.data(pheno.mat = y, geno.mat = x, map = map,
                                 return.ZETA = TRUE, return.GWAS.format = TRUE)
  pheno.GWAS <- modify.data.res$pheno.GWAS
  geno.GWAS <- modify.data.res$geno.GWAS
  ZETA <- modify.data.res$ZETA



  ### Perform single-SNP GWAS with interaction
  ### by testing all effects (including SNP effects) simultaneously
  normal.res.int <-
    RGWAS.normal.interaction(
      pheno = pheno.GWAS,
      geno = geno.GWAS,
      ZETA = ZETA,
      interaction.with.SNPs = NULL,
      interaction.mat.method = "PCA",
      n.interaction.element = 3,
      interaction.group = NULL,
      n.interaction.group = 3,
      interaction.group.method = "find.clusters",
      n.PC.dapc = 3,
      test.method.interaction = "simultaneous",
      n.PC = 3,
      P3D = TRUE,
      plot.qq = FALSE,
      plot.Manhattan = FALSE,
      verbose = FALSE,
      verbose2 = FALSE,
      count = FALSE,
      time = FALSE,
      package.MM = "gaston",
      parallel.method = "mclapply",
      skip.check = TRUE,
      n.core = 1
    )
# }


# \donttest{
  ### Import RAINBOWR
  require(RAINBOWR)

  ### Load example datasets
  data("Rice_Zhao_etal")
  Rice_geno_score <- Rice_Zhao_etal$genoScore
  Rice_geno_map <- Rice_Zhao_etal$genoMap
  Rice_pheno <- Rice_Zhao_etal$pheno

  ### View each dataset
  See(Rice_geno_score)
  See(Rice_geno_map)
  See(Rice_pheno)

  ### Select one trait for example
  trait.name <- "Flowering.time.at.Arkansas"
  y <- as.matrix(Rice_pheno[, trait.name, drop = FALSE])

  ### Remove SNPs whose MAF <= 0.05
  x.0 <- t(Rice_geno_score)
  MAF.cut.res <- MAF.cut(x.0 = x.0, map.0 = Rice_geno_map)
  x <- MAF.cut.res$x
  map <- MAF.cut.res$map


  ### Estimate genomic relationship matrix (GRM)
  K.A <- calcGRM(genoMat = x)


  ### Modify data
  modify.data.res <-
    modify.data(
      pheno.mat = y,
      geno.mat = x,
      map = map,
      return.ZETA = TRUE,
      return.GWAS.format = TRUE
    )
  pheno.GWAS <- modify.data.res$pheno.GWAS
  geno.GWAS <- modify.data.res$geno.GWAS
  ZETA <- modify.data.res$ZETA


  ### View each data for RAINBOWR
  See(pheno.GWAS)
  See(geno.GWAS)
  str(ZETA)



  ### Perform single-SNP GWAS with interaction
  ### by testing all effects (including SNP effects) simultaneously
  normal.res.int <-
    RGWAS.normal.interaction(
      pheno = pheno.GWAS,
      geno = geno.GWAS,
      ZETA = ZETA,
      interaction.with.SNPs = NULL,
      interaction.mat.method = "PCA",
      n.interaction.element = 3,
      interaction.group = NULL,
      n.interaction.group = 3,
      interaction.group.method = "find.clusters",
      n.PC.dapc = 3,
      test.method.interaction = "simultaneous",
      n.PC = 3,
      P3D = TRUE,
      plot.qq = TRUE,
      plot.Manhattan = TRUE,
      verbose = TRUE,
      verbose2 = FALSE,
      count = TRUE,
      time = TRUE,
      package.MM = "gaston",
      parallel.method = "mclapply",
      skip.check = TRUE,
      n.core = 2
    )
  See(normal.res.int$D[[1]])  ### Column 4 contains -log10(p) values
                              ### for all effects (including SNP effects)
# }

Run the code above in your browser using DataLab