FFBSKAT: Fast Family-Based SKAT, SKAT-O

Description

A fast kernel-based regional association analysis in related or population samples

Usage

FFBSKAT(formula, phenodata, genodata, kin = NULL, nullmod,
regions = NULL, sliding.window = c(20, 10), mode = "add",
ncores = 1, return.time = FALSE, kernel = "linear.weighted",
beta.par = c(1, 25), weights = NULL, method = "kuonen",
acc = 1e-8, lim = 1e+6, return.variance.explained = FALSE,
reml = TRUE, flip.genotypes = FALSE, impute.method = 'mean',
rho = FALSE, write.file = FALSE, ...)

Arguments

formula

referring to the column(s) in phenodata to be analyzed as outcome and, if needed, covariates.

phenodata

a data frame containing columns mentioned in formula: trait to analyze and, if needed, covariates. Individuals not measured for trait or covariates will be omitted.

genodata

an object with genotypes to analyze. Several formats are allowed: - a data frame or matrix (with individuals in the rows and genetic variants in the columns) containing genotypes coded as AA = 0, Aa = 1 and aa = 2, where a is a minor allele. - for PLINK binary data format, a character string indicating a *.bed file name (*.bim and *.fam files should have the same prefix). This will make use of read.plink() function. - for VCF format, a character string indicating a *vcf.gz file name. This will require seqminer R-package to be installed. Its readVCFToMatrixByGene() function will be used to read VCF file gene-wise. The function also requires a geneFile, a text file listing all genes in refFlat format (see Examples below). VCF file should be bgzipped and indexed by Tabix. - an object of gwaa.data or snp.data class (this will require GenABEL R-package to be installed).

kin

a square symmetric matrix giving the pairwise kinship coefficients between analyzed individuals. Under default kin = NULL all individuals will be considered as unrelated.

nullmod

an object containing parameter estimates under the null model. Setting nullmod allows to avoid re-estimation of the null model that does not depend on genotypes and can be calculated once for a trait. If not set, the null model parameters will be estimated within the function. The nullmod object in proper format can be obtained by null.model() function or any analysis function in FREGAT.

regions

an object assigning regions to be analyzed. This can be: - a vector of length equal to the number of genetic variants assigning the region for each variant (see Examples). - a data frame / matrix with names of genetic variants in the first column and names of regions in the second column (this format allows overlapping regions). - for VCF format, a character vector with names of genes to analyze. If NULL, sliding.window parameters will be used.

sliding.window

the sliding window size and step. Has no effect if regions is defined.

mode

the mode of inheritance: "add", "dom" or "rec" for additive, dominant or recessive mode, respectively. For dominant (recessive) mode genotypes will be recoded as AA = 0, Aa = 1 and aa = 1 (AA = 0, Aa = 0 and aa = 1), where a is a minor allele. Default mode is additive.

ncores

number of CPUs for parallel calculations. Default = 1.

return.time

a logical value indicating whether the running time should be returned.

kernel

one of "linear.weighted" (default), "quadratic", "IBS", "IBS.weighted", "2wayIX". See Details for "linear.weighted" kernel description and [Wu, et al., 2011] for other kernel types. "2wayIX" kernel considers SNP-SNP interaction terms along with main effects. For "linear.weighted" and "IBS.weighted" kernels, weights can be varied by defining weights or beta.par.

beta.par

two positive numeric shape parameters in the beta distribution to assign weights for each SNP in weighted kernels (see Details). Default = c(1, 25) is recommended for analysis of rare variants. Has no effect for unweighted kernels or if weights are defined.

weights

a numeric vector or a function of minor allele frequency (MAF) to assign weights for each genetic variant in the weighted kernels. Has no effect if one of unweighted kernels was chosen. If NULL, the weights will be calculated using the beta distribution (see Details).

method

either "kuonen" or "davies". Method for computing the P value (see Details). Default = "kuonen".

acc

accuracy parameter for "davies" method.

lim

limit parameter for "davies" method.

return.variance.explained

a logical value indicating whether the (marginal) variance explained by each region should be returned. Default = FALSE for faster performance.

reml

a logical value indicating whether the restricted maximum likelihood should be used to estimate the variance explained by each region. Default = TRUE for faster performance. Has no effect if return.variance.explained = FALSE.

flip.genotypes

a logical value indicating whether the genotypes of some genetic variants should be flipped (relabeled) to ensure that all minor allele frequencies (MAFs) < 0.5. Default = FALSE, with warning of any MAF > 0.5.

impute.method

a method for imputation of missing genotypes. It can be either "mean" (default) or "blue". If "mean" the genotypes will be imputed by the simple mean values. If "blue" the best linear unbiased estimates (BLUEs) of mean genotypes will be calculated taking into account the relationships between individuals [McPeek, et al., 2004, DOI: 10.1111/j.0006-341X.2004.00180.x] and used for imputation.

rho

If TRUE the "optimal" test (SKAT-O) is performed [Lee, et al., 2012]. rho can be a vector of grid values from 0 to 1. The default grid is (0 : 10) / 10, can be set to c(0, 0.1^2, 0.2^2, 0.3^2, 0.4^2, 0.5^2, 0.5, 1) to perform "optimal.adj" test.

write.file

output file name to write results as they come (sequential mode only).

...

other arguments that could be passed to null(), read.plink() and readVCFToMatrixByGene().

Value

A list with values:

results

a data frame containing P values, numbers of variants and polymorphic variants for each of analyzed regions. If return.variance.explained = TRUE it contains also the column with marginal amounts of variance explained by each region. If reml = FALSE the new estimates of heritability (h2) and total variance with corresponding total log-likelihood are also returned.

nullmod

an object containing the estimates of the null model parameters: heritability (h2), total variance (total.var), estimates of fixed effects of covariates (alpha), the gradient (df), and the total log-likelihood (logLH).

sample.size

the sample size after omitting NAs.

time

If return.time = TRUE a list with running times for null model, regional analysis and total analysis is returned. See proc.time() for output format.

Details

By default, FFBSKAT uses the linear weighted kernel function to set the inter-individual similarity matrix \(K = GWWG^T\) for SKAT and \(K = GW(I\rho + (1 - \rho)ee^T)WG^T\) for SKAT-O, where \(\mathit{G}\) is the \(\mathit{n\times p}\) genotype matrix for \(\mathit{n}\) individuals and \(\mathit{p}\) genetic variants in the region, \(\mathit{W}\) is the \(\mathit{p\times p}\) diagonal weight matrix, \(I\) is the \(\mathit{p\times p}\) identity matrix, \(\rho\) is pairwise correlation coefficient between genetic effects (which will be adaptively selected from given rho), and \(e\) is the \(\mathit{p\times 1}\) vector of ones. Given the shape parameters of the beta function, beta.par = c(a, b), the weights are defined using probability density function of the beta distribution:

\(W_{i}=(B(a,b))^{^{-1}}MAF_{i}^{a-1}(1-MAF_{i})^{b-1} \),

where \(MAF_{i}\) is a minor allelic frequency for the \(i^{th}\) genetic variant in the region, which is estimated from genotypes, and \(B(a,b)\) is the beta function. This way of defining weights is the same as in original SKAT (see [Wu, et al., 2011] for details). beta.par = c(1, 1) corresponds to the unweighted SKAT. The formula:

\(Q=0.5\tilde{y}^{T}\Omega^{-1}K\Omega^{-1}\tilde{y}\)

is used to calculate score statistic, where \(\tilde{y}\) and \(\Omega\) are environmental residuals and covariance matrix obtained under the null hypothesis, respectively. Depending on the method option chosen, either Kuonen or Davies method is used to calculate P values from the score statistic Q. Both an Applied Statistics algorithm that inverts the characteristic function of the mixture chisq [Davies, 1980] and a saddlepoint approximation [Kuonen, 1999] are nearly exact, with the latter usually being a bit faster. For other kernel types, see [Wu, et al., 2011].

References

Svishcheva G.R., Belonogova N.M. and Axenovich T.I. (2014) FFBSKAT: Fast Family-Based Sequence Kernel Association Test. PLoS ONE 9(6): e99407. doi:10.1371/journal.pone.0099407 Davies R.B. (1980) Algorithm AS 155: The Distribution of a Linear Combination of chi-2 Random Variables, Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 29, N 3, P. 323-333. Kuonen D. (1999) Saddlepoint Approximations for Distributions of Quadratic Forms in Normal Variables. Biometrika, Vol. 86, No. 4, P. 929-935. Wu M.C., et al. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet., Vol. 89, P. 82-93. Lee S., et al. (2012) Optimal unified approach for rare variant association testing with application to small sample case-control whole-exome sequencing studies. American Journal of Human Genetics, 91, 224-237.

Examples

Run this code

# NOT RUN {
data(example.data)  

## Run FFBSKAT with sliding window (default):
out <- FFBSKAT(trait ~ age + sex, phenodata, genodata, kin)

## Run FFBSKAT with regions defined in snpdata$gene and with
## null model obtained in first run:
out <- FFBSKAT(trait ~ age + sex, phenodata, genodata, kin,
out$nullmod, regions = snpdata$gene)

## Run FFBSKAT parallelized on two cores (this will require
## 'foreach' and 'doParallel' R-packages installed and
## cores available):
out <- FFBSKAT(trait ~ age + sex, phenodata, genodata, kin,
	out$nullmod, ncores = 2)

## Run FFBSKAT with genotypes in VCF format:
VCFfileName <- system.file(
	"testfiles/1000g.phase1.20110521.CFH.var.anno.vcf.gz",
	package = "FREGAT")
geneFile <- system.file("testfiles/refFlat_hg19_6col.txt.gz",
	package = "FREGAT")
phe <- data.frame(trait = rnorm(85))
out <- FFBSKAT(trait, phe, VCFfileName, geneFile = geneFile,
	reg = "CFH", annoType = "Nonsynonymous",
	flip.genotypes = TRUE)

## Run FFBSKAT with genotypes in PLINK binary data format:
bedFile <- system.file("testfiles/sample.bed",
	package = "FREGAT")
phe <- data.frame(trait = rnorm(120))
out <- FFBSKAT(trait, phe, bedFile)

# }

Run the code above in your browser using DataLab