sumSTAAR: variant-Set Test for Association using Annotation infoRmation on summary statistics

Description

STAAR procedure developed by Li et al. (2020) adapted to use on summary statistics, with extentions

Usage

sumSTAAR(score.file, gene.file, genes = 'all', cor.path = 'cor/',
tests = c('BT', 'SKAT', 'ACAT'), beta.par.matrix = rbind(c(1, 1), c(1, 25)), 
prob.causal = 'all', phred = TRUE, n = NA, mac.threshold = NA,
approximation = TRUE, write.file = FALSE, staar.output = TRUE, quiet = FALSE)

Arguments

score.file

name of data file generated by prep.score.files().

gene.file

can be a name of a custom text file listing genes in refFlat format. Values "hg19" and "hg38" can be set to use default gene files containing protein-coding genes with start and stop positions from corresponding build. Mind that the same build should be used in score.file and gene.file.

genes

character vector of gene names to be analyzed. Can be "chr1", "chr2" etc. to analyze all genes on a corresponding chromosome. If not set, function will attempt to analyze all genes listed in gene.file.

cor.path

path to a folder with correlation matrix files (one file per each gene to be analyzed). Correlation matrices in text format are allowed, though ".RData" is preferable as computationally efficient. Each file should contain a square matrix with correlation coefficients (r) between genetic variants of a gene. An example of correlation file format: "snpname1" "snpname2" "snpname3" ... "snpname1" 1 0.018 -0.003 ... "snpname2" 0.018 1 0.081 ... "snpname3" -0.003 0.081 1 ... ... If genotypes are available, matrices can be generated as follows: cor.matrix <- cor(g) save(cor.matrix, file = paste0(geneName, ".RData")) where g is a genotype matrix (nsample x nvariants) for a given gene with genotypes coded as 0, 1, 2 (coding should be exactly the same that was used to generate GWAS statistics). If genotypes are not available, correlations can be approximated through reference samples. Reference matrices from 1KG European sample can be downloaded at http://mga.bionet.nsc.ru/sumFREGAT/. Names of correlation files should be constructed as "geneName.RData" (e.g. "ABCG1.RData", "ADAMTS1.RData", etc.) for ".RData" format or "geneName.txt" for text format. Example corfiles can be found as: system.file("testfiles/CFH.cor", package = "sumFREGAT") system.file("testfiles/CFH.RData", package = "sumFREGAT")

tests

a character vector with gene-based methods to be applied. By default, three methods are used: 'BT', 'SKAT', and 'ACAT'. Other weighted tests ('SKATO', 'PCA', 'FLM') can be included. Tests that do not imply weighting ('simpleM', 'minP', 'sumchi'), if listed, will be calculated once and combined along with other tests into the total sumSTAAR p-value. ACAT performs with gen.var.weights = "af" for consistency with STAAR.

beta.par.matrix

an (n x 2) matrix with rows corresponding to n beta.par values used sequentially. Default value rbind(c(1, 1), c(1, 25)) means that beta.par = c(1, 1) and beta.par = c(1, 25) will be applied. beta.par are two positive numeric shape parameters in the beta distribution to assign weights for each genetic variant as a function of minor allele frequency (MAF).

prob.causal

a character vector to define a set of annotations to be used. Annotations should be initially passed to prep.score.files (see prep.score.files() function) as input file columns "PROB", "PROB1", "PROB2", "PROB3" etc. The same naming should be used for prob.causal argument. By default, all "PROB" columns will be used. Annotations are expected to be in PHRED scale, set phred = FALSE to use simple probabilities.

phred

a logical value indicating whether probabilities are in PHRED scale. If TRUE (default for sumSTAAR()), PHRED-to-probability transformation will be performed for values indicated in prob.causal.

size of the sample on which summary statistics were obtained. Should be assigned if 'PCA' or 'FLM' are included in tests.

mac.threshold

an integer number. In ACAT, scores with MAC <= 10 will be combined using Burden test. MACs are calculated from MAFs, n must be set to apply mac.threshold. The original STAAR procedure performs with mac.threshold = 10.

approximation

logical value indicating whether approximation should be used for SKAT, SKATO, PCA and FLM.

write.file

output file name. If specified, output for all tests (as it proceeds) will be written to corresponding files.

staar.output

logical value indicating whether extensive output format should be used (see Details).

quiet

quiet = TRUE suppresses excessive output from reading vcf file genewise.

Value

With staar.output = FALSE returns a data table with a single P value for each test (combinations of differently weighted and unweighted iterations of the same test) and a total sumSTAAR P value. ACAT-O method is used to combine P values (Liu, Y. et al., 2019).

If staar.output = TRUE the function returns a data frame of size (n.genes x (n.tests x n.beta.pars x (n.annotations + 1) + n.tests + 2)) containing P values for combined tests and all individual tests. The output is analogous to that of original STAAR procedure. For example, by default, the output will contain columns: gene # gene symbol BT.1.1.PROB0 # P value for burden test with beta.par = c(1, 1) and no annotations BT.1.1.PROB1 # P value for burden test with beta.par = c(1, 1) and weighted by first annotation BT.1.1.PROB2 # P value for burden test with beta.par = c(1, 1) and weighted by second annotation ... BT.1.1.PROB10 # P value for burden test with beta.par = c(1, 1) and weighted by tenth annotation BT.1.1.STAAR # combined P value of all burden tests with beta.par = c(1, 1) BT.1.25.PROB0 # P value for burden test with beta.par = c(1, 25) and no annotations BT.1.25.PROB1 # P value for burden test with beta.par = c(1, 25) and weighted by first annotation ... BT.1.25.PROB10 # P value for burden test with beta.par = c(1, 25) and weighted by tenth annotation BT.1.25.STAAR # combined P value of all burden tests with beta.par = c(1, 25) SKAT.1.1.PROB0 # P value for SKAT with beta.par = c(1, 1) and no annotations SKAT.1.1.PROB1 # P value for SKAT with beta.par = c(1, 1) and weighted by first annotation ... SKAT.1.1.PROB10 # P value for SKAT with beta.par = c(1, 1) and weighted by tenth annotation SKAT.1.1.STAAR # combined P value of all SKATs with beta.par = c(1, 1) SKAT.1.25.PROB0 # P value for SKAT with beta.par = c(1, 25) and no annotations ... SKAT.1.25.PROB10 # P value for SKAT with beta.par = c(1, 25) and weighted by tenth annotation SKAT.1.25.STAAR # combined P value of all SKATs with beta.par = c(1, 25) ACAT.1.1.PROB0 # P value for ACAT with beta.par = c(1, 1) and no annotations ACAT.1.1.PROB1 # P value for ACAT with beta.par = c(1, 1) and weighted by first annotation ... ACAT.1.1.PROB10 # P value for ACAT with beta.par = c(1, 1) and weighted by tenth annotation ACAT.1.1.STAAR # combined P value of all ACATs with beta.par = c(1, 1) ACAT.1.25.PROB0 # P value for ACAT with beta.par = c(1, 25) and no annotations ... ACAT.1.25.PROB10 # P value for ACAT with beta.par = c(1, 25) and weighted by tenth annotation ACAT.1.25.STAAR # combined P value of all ACATs with beta.par = c(1, 25) sumSTAAR.ACAT_O # combined P value of all 'PROB0' gene-based tests (without weighting by annotations) sumSTAAR.STAAR_O # combined P value of all gene-based tests

Details

The STAAR procedure has been recently proposed by Li et al. (2020) and described in detail therein. It calculates a set of P values using a range of gene-based tests, beta distribution weights parameters, and weighting by each of 10 functional annotations. The P values are then combined using Cauchy method (see ACATO() function and Liu, Y. et al., (2019)).

References

Belonogova et al. (2022) SumSTAAR: A flexible framework for gene-based association studies using GWAS summary statistics. PLOS Comp Biol. https://doi.org/10.1371/journal.pcbi.1010172 Li X., et al. (2020) Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nature Genetics. DOI: 10.1038/s41588-020-0676-4. Liu Y. et al. (2019) ACAT: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet. 104, 410-421.

Examples

Run this code

# NOT RUN {
cor.path <- system.file("testfiles/", package = "sumFREGAT")
score.file <- system.file("testfiles/CFH.prob.vcf.gz",
	package = "sumFREGAT")
sumSTAAR(score.file, prob.causal = "PROB", gene.file = "hg19",
	genes = "CFH", cor.path, quiet = TRUE)

# }
# NOT RUN {
score.file <- system.file("testfiles/CFH.prob.phred.vcf.gz",
	package = "sumFREGAT")
res <- sumSTAAR(score.file, gene.file = "hg19", genes = "CFH",
	cor.path, quiet = TRUE)
# }

Run the code above in your browser using DataLab