STAAR procedure developed by Li et al. (2020) adapted to use on summary statistics, with extentions
sumSTAAR(score.file, gene.file, genes = 'all', cor.path = 'cor/',
tests = c('BT', 'SKAT', 'ACAT'), beta.par.matrix = rbind(c(1, 1), c(1, 25)),
prob.causal = 'all', phred = TRUE, n = NA, mac.threshold = NA,
approximation = TRUE, write.file = FALSE, staar.output = TRUE, quiet = FALSE)
name of data file generated by prep.score.files()
.
can be a name of a custom text file listing genes in refFlat format. Values "hg19" and "hg38"
can be set to use default gene files containing protein-coding genes with start and stop positions from corresponding build.
Mind that the same build should be used in score.file
and gene.file
.
character vector of gene names to be analyzed. Can be "chr1", "chr2" etc. to analyze
all genes on a corresponding chromosome. If not set, function will
attempt to analyze all genes listed in gene.file
.
path to a folder with correlation matrix files (one file per each gene to be analyzed).
Correlation matrices in text format are allowed, though ".RData" is preferable as computationally efficient.
Each file should contain a square matrix with correlation coefficients (r) between genetic variants
of a gene. An example of correlation file format:
"snpname1" "snpname2" "snpname3" ...
"snpname1" 1 0.018 -0.003 ...
"snpname2" 0.018 1 0.081 ...
"snpname3" -0.003 0.081 1 ...
...
If genotypes are available, matrices can be generated as follows:
cor.matrix <- cor(g)
save(cor.matrix, file = paste0(geneName, ".RData"))
where g
is a genotype matrix (nsample x nvariants) for a given gene with genotypes coded as 0, 1, 2
(coding should be exactly the same that was used to generate GWAS statistics).
If genotypes are not available, correlations can be approximated through reference samples.
Reference matrices from 1KG European sample can be downloaded at http://mga.bionet.nsc.ru/sumFREGAT/.
Names of correlation files should be constructed as "geneName.RData" (e.g. "ABCG1.RData", "ADAMTS1.RData", etc.)
for ".RData" format or "geneName.txt" for text format.
Example corfiles can be found as:
system.file("testfiles/CFH.cor", package = "sumFREGAT")
system.file("testfiles/CFH.RData", package = "sumFREGAT")
a character vector with gene-based methods to be applied. By default, three methods are used: 'BT', 'SKAT',
and 'ACAT'. Other weighted tests ('SKATO', 'PCA', 'FLM') can be included. Tests that do not imply
weighting ('simpleM', 'minP', 'sumchi'), if listed, will be calculated once and combined along with other tests
into the total sumSTAAR p-value. ACAT performs with gen.var.weights = "af"
for consistency with STAAR.
an (n x 2) matrix with rows corresponding to n beta.par
values used sequentially. Default value
rbind(c(1, 1), c(1, 25))
means that beta.par = c(1, 1)
and beta.par = c(1, 25)
will be applied. beta.par
are two positive numeric shape parameters in the beta distribution to assign weights
for each genetic variant as a function of minor allele frequency (MAF).
a character vector to define a set of annotations to be used. Annotations should be initially passed to
prep.score.files
(see prep.score.files()
function) as input file columns "PROB", "PROB1", "PROB2", "PROB3" etc.
The same naming should be used for prob.causal
argument. By default, all "PROB" columns will be used. Annotations
are expected to be in PHRED scale, set phred = FALSE
to use simple probabilities.
a logical value indicating whether probabilities are in PHRED scale. If TRUE
(default for sumSTAAR()
), PHRED-to-probability
transformation will be performed for values indicated in prob.causal
.
size of the sample on which summary statistics were obtained. Should be assigned if 'PCA' or 'FLM'
are included in tests
.
an integer number. In ACAT, scores with MAC <= 10 will be combined using Burden test. MACs are calculated
from MAFs, n
must be set to apply mac.threshold
. The original STAAR procedure performs with mac.threshold = 10
.
logical value indicating whether approximation should be used for SKAT, SKATO, PCA and FLM.
output file name. If specified, output for all tests (as it proceeds) will be written to corresponding files.
logical value indicating whether extensive output format should be used (see Details).
quiet = TRUE
suppresses excessive output from reading vcf file genewise.
With staar.output = FALSE
returns a data table with a single P value for each test (combinations of differently weighted and unweighted
iterations of the same test) and a total sumSTAAR P value. ACAT-O method is used to combine P values (Liu, Y. et al., 2019).
If staar.output = TRUE
the function returns a data frame of size (n.genes x (n.tests x n.beta.pars x (n.annotations + 1) + n.tests + 2)) containing P values
for combined tests and all individual tests. The output is analogous to that of original STAAR procedure. For example,
by default, the output will contain columns:
gene # gene symbol
BT.1.1.PROB0 # P value for burden test with beta.par = c(1, 1) and no annotations
BT.1.1.PROB1 # P value for burden test with beta.par = c(1, 1) and weighted by first annotation
BT.1.1.PROB2 # P value for burden test with beta.par = c(1, 1) and weighted by second annotation
...
BT.1.1.PROB10 # P value for burden test with beta.par = c(1, 1) and weighted by tenth annotation
BT.1.1.STAAR # combined P value of all burden tests with beta.par = c(1, 1)
BT.1.25.PROB0 # P value for burden test with beta.par = c(1, 25) and no annotations
BT.1.25.PROB1 # P value for burden test with beta.par = c(1, 25) and weighted by first annotation
...
BT.1.25.PROB10 # P value for burden test with beta.par = c(1, 25) and weighted by tenth annotation
BT.1.25.STAAR # combined P value of all burden tests with beta.par = c(1, 25)
SKAT.1.1.PROB0 # P value for SKAT with beta.par = c(1, 1) and no annotations
SKAT.1.1.PROB1 # P value for SKAT with beta.par = c(1, 1) and weighted by first annotation
...
SKAT.1.1.PROB10 # P value for SKAT with beta.par = c(1, 1) and weighted by tenth annotation
SKAT.1.1.STAAR # combined P value of all SKATs with beta.par = c(1, 1)
SKAT.1.25.PROB0 # P value for SKAT with beta.par = c(1, 25) and no annotations
...
SKAT.1.25.PROB10 # P value for SKAT with beta.par = c(1, 25) and weighted by tenth annotation
SKAT.1.25.STAAR # combined P value of all SKATs with beta.par = c(1, 25)
ACAT.1.1.PROB0 # P value for ACAT with beta.par = c(1, 1) and no annotations
ACAT.1.1.PROB1 # P value for ACAT with beta.par = c(1, 1) and weighted by first annotation
...
ACAT.1.1.PROB10 # P value for ACAT with beta.par = c(1, 1) and weighted by tenth annotation
ACAT.1.1.STAAR # combined P value of all ACATs with beta.par = c(1, 1)
ACAT.1.25.PROB0 # P value for ACAT with beta.par = c(1, 25) and no annotations
...
ACAT.1.25.PROB10 # P value for ACAT with beta.par = c(1, 25) and weighted by tenth annotation
ACAT.1.25.STAAR # combined P value of all ACATs with beta.par = c(1, 25)
sumSTAAR.ACAT_O # combined P value of all 'PROB0' gene-based tests (without weighting by annotations)
sumSTAAR.STAAR_O # combined P value of all gene-based tests
The STAAR procedure has been recently proposed by Li et al. (2020) and described in detail therein. It calculates
a set of P values using a range of gene-based tests, beta distribution weights parameters, and weighting by each of
10 functional annotations. The P values are then combined using Cauchy method (see ACATO()
function and Liu, Y. et al., (2019)).
Belonogova et al. (2022) SumSTAAR: A flexible framework for gene-based association studies using GWAS summary statistics. PLOS Comp Biol. https://doi.org/10.1371/journal.pcbi.1010172 Li X., et al. (2020) Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nature Genetics. DOI: 10.1038/s41588-020-0676-4. Liu Y. et al. (2019) ACAT: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet. 104, 410-421.
# NOT RUN {
cor.path <- system.file("testfiles/", package = "sumFREGAT")
score.file <- system.file("testfiles/CFH.prob.vcf.gz",
package = "sumFREGAT")
sumSTAAR(score.file, prob.causal = "PROB", gene.file = "hg19",
genes = "CFH", cor.path, quiet = TRUE)
# }
# NOT RUN {
score.file <- system.file("testfiles/CFH.prob.phred.vcf.gz",
package = "sumFREGAT")
res <- sumSTAAR(score.file, gene.file = "hg19", genes = "CFH",
cor.path, quiet = TRUE)
# }
Run the code above in your browser using DataLab