enrichedChrRegions: Find chromosomal regions with a high concentration of hits.

Description

This function looks for chromosomal regions where there is a large accumulation of hits, e.g. significant peaks in a chip-seq experiment or differentially expressed genes in an rna-seq or microarray experiment. Regions are found by computing number of hits in a moving window and selecting regions based on a FDR cutoff.

Usage

enrichedChrRegions(hits1, hits2, chrLength, windowSize=10^4-1, fdr=0.05, nSims=10, mc.cores=1)

Arguments

Object containing hits (chromosome, start, and end). Can be a GRanges or RangedData object.

Optionally, another object containing hits. If specified, regions will be defined by comparing hits1 vs hits2.

chrLength

Named vector indicating the length of each chromosome in base pairs

windowSize

Size of the window used to smooth the hit count (see details)

fdr

Desired FDR level (see details)

nSims

Number of simulations to be used to estimate the FDR

mc.cores

Number of processors to be used in parallel computations (passed on to mclapply)

Value

Object of class GRanges (if input is GRanges) or RangedData (if input is RangedData) containing regions with smoothed hit count above the specified FDR level.

Methods

signature(hits1 = "GRanges", hits2 = "missing"), signature(hits1 = "RangedData", hits2 = "missing"): Look for chromosome zones with a large number of hits reported in hits1.
signature(hits1 = "GRanges", hits2 = "GRanges"), signature(hits1 = "RangedData", hits2 = "RangedData"): Look for chromosomal zones with a different density of hits in hits1 vs hits2.

Details

A smoothed number of hits is computed by counting the number of hits in a moving window of size windowSize. Notice that only the mid-point of each hit in hits1 (and hits2 if specified) is used. That is, hits are not treated as intervals but as being located at a single base pair.

If hits2 is missing, regions with large smoothed number of hits are selected. To assess statistical significance, we generate hits (also 1 base pair long) randomly distributed along the genome and compute the smoothed number of hits. The number of simulated hits is set equal to nrow(hits1). The process is repeated nSims times, resulting in several independent simulations. To estimate the FDR, several thresholds to define enriched chromosomal regions are considered. For each threshold, we count the number of regions above the threshold in the observed data and in the simulations. For each threshold t, the FDR is estimated as the average number of regions with score >=t in the simulations over the number of regions with score >=t in the observed data.

If hits2 is not missing, the difference in smoothed proportion of hits (i.e. the number of hits in the window divided by the overall number of hits) between the two groups is used as a test statistic. To assess statistical significance, we generate randomly scramble hits between sample 1 and sample 2 (maintaining the original number of hits in each sample), and we re-compute the test statistic. The FDR for a given threshold t is estimated as the number of bases in the simulated data with test statistic>t divided by number of bases in observed data with test statistic>t.

The lowest t with estimated FDR below fdr is used to define enriched chromosomal regions.

Examples

Run this code

set.seed(1)
st <- round(rnorm(100,500,100))
st[st>10000] <- 10000
strand <- rep(c('+','-'),each=50)
hits1 <- GRanges('chr1', IRanges(st,st+38),strand=strand)
chrLength <- c(chr1=10000)
enrichedChrRegions(hits1,chrLength=chrLength, windowSize=99, nSims=1)

Run the code above in your browser using DataLab