Learn R Programming

extraBinomial (version 2.1)

exbio: Extra-binomial approach for pooled sequencing data

Description

This funtion tests for differences in minor allele frequency between groups and is based on extra-binomial variation model for pooled sequencing data.

Usage

exbio(R, R.alt, cc, n, tol = 0.001, a.start = 1, b.start = 1, max.it = 1000, digits = NULL, model.maf = TRUE)

Arguments

R
A matrix with rows indexed by SNPs and columns by pools. The entries are counts of allele 1.
R.alt
A similarly formatted matrix containing the counts of allele 2.
cc
A case/control indicator vector with length = number of pools containing 0s (control pool) and 1s (case pool).
n
Number of chromosomes (twice the number of subjects) in each pooled sample.

tol
Maximum difference between coefficient values in successive glm before we can stop, the default=0.001.

a.start
An intial value for the parameter a in linear regression, the default=1.

b.start
An intial value for the parameter b in linear regression, the default=1.
max.it
Maximum iterations, the default=1000.

digits
How many significant digits are to be used for allele frequency and p-value. The default, 'NULL', uses 'getOption(digits)'.

model.maf
A logical value indicating whether to allow the modelled error structure to depend on allele frequency (the default) or just read depth. The default=TRUE.

Value

A list containing the following components:
result
a data.frame with three columns: the first shows the minor allele frequency of controls; the second shows the minor allele freqeuncy of cases; the third shows the p-value. Each row stands for a SNP.
parameters
a character vector indicating the values of the parameters a and b (and b2, b3 if model.maf=TRUE) in the linear regression and and the times of iteration.

Details

R and R.alt contain the read counts for the major allele and the alternative allele respectively and are required to have the same dimension.

The extra-binomial model defined: E(R/N)=p, Var(R/N)=p(1-p)(a/n+b/N) when N=R+R.alt

We denote: W=1/(a/n+b/N), which may be interpreted as the adjusted depth of pool j for SNP i. Given the expected quantities: E(r2)=1/W=a/n+b/N, the parameters a and b can be estimated by linear regression of r2 on 1/N, giving a/n as the intercept and b as the slope. If model.maf=TRUE, W=1/(a/n+b/N+b2*p+b3*p^2) and two additional parameters (b2 and b3) are estimated. This regression is carried out using generalized linear model (GLM) by first adopting Gaussian errors to estimate a relatively good start value of a and b, and then using these start values to do GLM with gamma errors and identity link because both a and b are positive. Since the estimated allele frequency p depends on a and b, the calculations are carried out iteratively.

A chi-square test is performed on a 2*2 table using the weighted allele counts to calculate the p-value.

References

Yang et al. "Extra-binomial variation approach for analysis of pooled DNA sequencing data", under review.

Examples

Run this code
R<-matrix(c(1409,1530,1490,1630,924,998,1000,1012),nrow=2,ncol=4,byrow=TRUE)
R.alt<-matrix(c(170,210,192,209,13,14,30,38),nrow=2,ncol=4,byrow=TRUE)
cc<-c(0,0,1,1)
n=96
exbio(R, R.alt, cc, n, max.it = 100, digits=3)
##=> p.value = 9.91e-01 for SNP1 and 4.01e-11 for SNP2,
##so association for SNP2 is established, but not for SNP1.

Run the code above in your browser using DataLab