enrichedRegions: Find significantly enriched regions in sequencing experiments.

Description

Find regions with a significant accumulation of reads in a sequencing experiment.

Usage

enrichedRegions(sample1, sample2, regions, minReads=10, mappedreads,
pvalFilter=0.05, exact=FALSE, p.adjust.method='none', twoTailed=FALSE,
mc.cores=1)

Arguments

sample1

Either start and end of sequences in sample 1 (IRangesList, RangedData or IRanges object), of RangedDataList with sequences for all samples (sample2 must be left missing in this case) .

sample2

Same for sample 2. Can be left missing.

regions

If specified, the analysis is restricted to the regions indicated in regions. If not specified, the regions are automatically defined using the argument minReads.

minReads

This argument is only used when regions is not specified. The regions to be tested for enrichment are those with coverage greater or equal than minReads. If sample1 is a RangedDataList, the overall coverage adding all samples is used. Otherwise, if twoTailed is FALSE, only the reads in sample 1 are counted. If twoTailed is TRUE, the sum of reads in samples 1 and 2 are counted.

mappedreads

Number of mapped reads for the sample. Has to be of class integer. Will be used to compute RPKM.

pvalFilter

Only regions with P-value below pvalFilter are reported as being enriched.

exact

If set to TRUE, an exact test is used whenever some expected cell counts are 5 or less (chi-square test based on permutations if sample1 is a RangedDataList object, Fisher's exact test otherwise), i.e. when the asymptotic chi-square/likelihood-ratio test calculations break down. Ignored if sample2 is missing, as in this case calculations are always exact.

p.adjust.method

P-value adjustment method, passed on to p.adjust.

twoTailed

If set to FALSE, only regions with a higher concentration of reads in sample 1 than in sample 2 are reported. If set to TRUE, regions with higher concentration of sample 2 reads are also reported. Ignored if sample2 is missing.

mc.cores

If mc.cores is greater than 1, computations are performed in parallel for each element in the IRangesList objects. Whenever possible the mclapply function is used, therefore exactly mc.cores are used. For some signatures mclapply cannot be used, in which case the parallel function from package parallel is used. Note: the latter option launches as many parallel processes as there are elements in x, which can place strong demands on the processor and memory.

Value

Object of class RangedData indicating the significantly enriched regions, the number of reads in each sample for those regions, the fold changes (adjusted considering the overall number of sequences in each sample) and the chi-square test P-values.

Methods

signature(sample1 = "missing", sample2 = "missing", regions = "RangedData"): ranges(regions) indicates the chromosome, start and end of genomic regions, while values{regions} should indicate the observed number of reads for each group in each region. enrichedRegions tests the null hypothesis that the proportion of reads in the region is equal across all groups via a likelihood-ratio test (or permutation-based chi-square for regions where the expected counts are below 5 for some group).
signature(sample1 = "RangedDataList", sample2 = "missing", regions = "missing"): Each element in sample1 contains the read start/end of an individual sample. enrichedRegions identifies regions with high concentration of reads (across all samples) and then compares the counts across groups using a likelihood-ratio test (or permutation-based chi-square for regions where the expected counts are below 5 for some group).
signature(sample1 = "RangedData", sample2 = "RangedData", regions = "missing"): space(sample1) indicates the chromosome, start(sample1) and end(sample1) the start/end position of the reads. Similarly for sample2. enrichedRegions identifies regions with high concentration of reads (across all samples) and then compares the counts across groups using a likelihood-ratio test (or permutation-based chi-square for regions where the expected counts are below 5 for some group).
signature(sample1 = "RangedData", sample2 = "missing", regions = "missing"): space(sample1) indicates the chromosome, start(sample1) and end(sample1) the start/end position of the reads. enrichedRegions tests the null hypothesis that an unusually high proportion of reads has been observed in the region using an exact binomial test.

Details

The calculations depend on whether sample2 is missing or not. Non-missing sample2 case. First, regions with coverage above minReads are selected. Second, the number of reads falling in the selected regions are computed for sample 1 and sample 2. Third, the counts are compared via a chi-square test (with Yates continuity correction), which takes into account the total number of sequences in each sample. Finally, statistically significant regions are selected and returned in RangedData or RangedDataList objects.

Missing sample2. First, regions with coverage above minReads are selected. Second, the number of reads in sample 1 falling in the selected regions is computed. Third, the proportion of reads in each region is tested for enrichment via a one-tailed Binomial exact test.

Examples

Run this code

set.seed(1)
st <- round(rnorm(1000,500,100))
strand <- rep(c('+','-'),each=500)
space <- rep('chr1',length(st))
sample1 <- RangedData(IRanges(st,st+38),strand=strand,space=space)
st <- round(rnorm(1000,1000,100))
sample2 <- RangedData(IRanges(st,st+38),strand=strand,space=space)
enrichedRegions(sample1,sample2,twoTailed=TRUE)

Run the code above in your browser using DataLab