qCount: Quantify alignments

Description

Quantify alignments from sequencing data.

Usage

qCount(proj,  query, reportLevel=c(NULL, "gene", "exon","promoter","junction"), selectReadPosition=c("start", "end"), shift=0L, orientation=c("any", "same", "opposite"), useRead=c("any", "first", "last"), auxiliaryName=NULL, mask=NULL, collapseBySample=TRUE, includeSpliced=TRUE, includeSecondary=TRUE, mapqMin=0L, mapqMax=255L, absIsizeMin=NULL, absIsizeMax=NULL, maxInsertSize=500L, clObj=NULL)

Arguments

proj

a qProject object representing a sequencing experiment as returned by qAlign

query

an object of type GRanges, GRangesList or TxDb with the regions to be quantified. The type of query will determine the mode of quantification (see ‘Details’). For reportLevel="junction", query is ignored and can also be NULL.

reportLevel

level of quantification (query of type TxDb or NULL), one of

gene (default): one value per gene
exon: one value per exon
promoter: one value per promoter
junction: one count per detected exon-exon junction (query will be ignored in this case)

selectReadPosition

defines the part of the alignment that has to be contained within a query region to produce an overlap (see Details). Possible values are:

start (default): start of the alignment
end: end of the alignment

shift

controls the shifting alignments towards their 3'-end before quantification. shift can be one of:

an “integer” vector of the same length as the number of alignment files
a single “integer” value
the character string "halfInsert" (only available for paired-end experiments)

The default of 0 will not shift any alignments.

orientation

sets the required orientation of the alignments relative to the query region in order to be counted, one of:

any (default): count alignment on the same and opposite strand
same : count only alignment on the same strand
opposite : count only alignment on the opposite strand

useRead

For paired-end experiments, selects the read mate whose alignments should be counted, one of:

any (default): count all alignments
first : count only alignments from the first read
last : count only alignments from the last read

auxiliaryName

which bam files to use in an experiments with auxiliary alignments (see Details).

mask

If not NULL, a GRanges object with reference regions to be masked, i.e. excluded from the quantification, such as unmappable or highly repetitive regions (see Details).

collapseBySample

if TRUE (the default), sum alignment counts from bam files with the same sample name.

includeSpliced

if TRUE (the default), include spliced alignments when counting. A spliced alignment is defined as an alignment with a gap in the read of at least 60 bases.

includeSecondary

if TRUE (the default), include alignments with the secondary bit (0x0100) set in the FLAG when counting.

mapqMin

minimal mapping quality of alignments to be included when counting (mapping quality must be greater than or equal to mapqMin). Valid values are between 0 and 255. The default (0) will include all alignments.

mapqMax

maximal mapping quality of alignments to be included when counting (mapping quality must be less than or equal to mapqMax). Valid values are between 0 and 255. The default (255) will include all alignments.

absIsizeMin

For paired-end experiments, minimal absolute insert size (TLEN field in SAM Spec v1.4) of alignments to be included when counting. Valid values are greater than 0 or NULL (default), which will not apply any minimum insert size filtering.

absIsizeMax

For paired-end experiments, maximal absolute insert size (TLEN field in SAM Spec v1.4) of alignments to be included when counting. Valid values are greater than 0 or NULL (default), which will not apply any maximum insert size filtering.

maxInsertSize

Maximal fragment size of the paired-end experiment. This parameter is used if shift="halfInsert" and will ensure that query regions are made wide enough to emcompass all alignment pairs whose mid falls into the query region. The default value is 500 bases.

clObj

a cluster object to be used for parallel processing (see ‘Details’).

Value

A matrix with effective query regions width in the first column, and alignment counts in subsequent columns, or a GRanges object if reportLevel="junction".The effective query region width returned as first column in the matrix is calculated by the number of unique, non-masked bases in the reference sequence that contributed to the count of this query name (irrespective if the bases were covered by alignments or not). An effective width of zero indicates that the region was fully masked and will have zero counts in all samples.The alignment counts in the matrix are contained from column two onwards. For projects with allele-specific quantification, i.e. if a file with single nucleotide polymorphisms was supplied to the snpFile argument of qAlign, there will be three columns per bam file (number of alignments for Reference, Unknown and Alternative genotypes, with suffixed _R, _U and _A). Otherwise there is a single columns per bam file.If collapseBySample=TRUE, groups of bam files with identical sample name are combined by summing their alignment counts.For reportLevel="junction", the return value is a GRanges object. The start and end coordinates correspond to the first and last base in each detected intron. Plus- and minus-strand alignments are quantified separately, so that in an unstranded RNA-seq experiment, the same intron may be represented twice; once for each strand. The counts for each sample are contained in the mcols of the GRanges object.

Details

qCount is used to count alignments in each sample from a qProject object. The features to be quantified, together with the mode of quantification, are specified by the query argument, which is one of:

GRanges: Overlapping alignments are counted separately for each coordinate region. If multiple regions have identical names, their counts will be summed, counting each alignment only once even if it overlaps more than one of these regions. Alignments may be counted more than once if they overlap multiple regions that have different names. This mode is for example used to quantify ChIP-seq alignments in promoter regions, or gene expression levels in an RNA-seq experiment (using a query with exon regions named by gene).
GRangesList: Alignments are counted and summed for each list element in query if they overlap with any of the regions contained in the list element. The order of the list elements defines a hierarchy for quantification: Alignment will only be counted for the first element (the one with the lowest index in query) that they overlap, but not for any potential further list elements containing overlapping regions. This mode can be used to hierarchically and uniquely count (assign) each alignment to a one of several groups of regions (the elements in query), for example to estimate the fractions of different classes of RNA in an RNA-seq experiment (rRNA, tRNA, snRNA, snoRNA, mRNA, etc.)
TxDb: Used to extract regions from annotation and report alignment counts depending on the value of reportLevel. If reportLevel="exon", alignments overlapping each exon in query are counted. If reportLevel="gene", alignment counts for all exons of a gene will be summed, counting each alignment only once even if it overlaps multiple annotated exons of a gene. These are useful to calculate exon or gene expression levels in RNA-seq experiments based on the annotation in a TxDb object. If reportLevel="promoter", the promoters function from package GenomicFeatures is used with default arguments to extract promoter regions around transcript start sites, e.g. to quantify alignments inf a ChIP-seq experiment.
any of the above or NULL for reportLevel="junction": The query argument is ignored if reportLevel is set to "junction", and qCount will count the number of alignments supporting each exon-exon junction detected in any of the samples in proj. The arguments selectReadPosition, shift, orientation, useRead and mask will have no effect in this quantification mode.

The additional arguments allow to fine-tune the quantification:

selectReadPosition defines the part of the alignment that has to be contained within a query region for an overlap. The values start (default) and end refer to the biological start (5'-end) and end (3'-end) of the alignment. For example, the start of an alignment on the plus strand is its leftmost (lowest) base, and the end of an alignment on the minus strand is also the leftmost base.

shift allows on-the-fly shifting of alignments towards their 3'-end prior to overlap determination and counting. This can be helpful to increase resolution of ChIP-seq experiments by moving alignments by half the immuno-precipitated fragment size towards the middle of fragments. shift is either an “integer” vector with one value per alignment file in proj, or a single “integer” value, in which case all alignment files will be shifted by the same value. For paired-end experiments, it can be alternatively set to "halfInsert", which will estimate the true fragment size from the distance between aligned read pairs and shift the alignments accordingly. orientation controls the interpretation of alignment strand when counting, relative to the strand of the query region. any will count all overlapping alignments, irrespective of the alignment strand (e.g. used in an unstranded RNA-seq experiment). same will only count the alignments on the same strand as the query region (e.g. in a stranded RNA-seq experiment), and opposite will only count the alignments on the opposite strand from the query region (e.g. to quantify anti-sense transcription in a stranded RNA-seq experiment). includeSpliced and includeSecondary can be used to include or exclude spliced or secondary alignments, respectively. mapqMin and mapqMax allow to select alignments based on their mapping qualities. mapqMin and mapqMax can take integer values between 0 and 255 and equal to $-10 log10 Pr(mapping position is wrong)$, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.

In paired-end experiments, useRead allows to quantify either all alignments (useRead="any"), or only the first (useRead="first") or last (useRead="last") read from a read pair or read group. Note that for useRead="any" (the default), an alignment pair that is fully contained within a query region will contribute two counts to the value of that region. absIsizeMin and absIsizeMax can be used to select alignments based on their insert size (TLEN field in SAM Spec v1.4).

auxiliaryName selects the reference sequence for which alignments should be quantified. NULL (the default) will select alignments against the genome. If set to a character string that matches one of the auxiliary target names (as specified in the auxiliaryFile argument of qAlign), the corresponding alignments will be counted.

mask can be used to specify a GRanges object with regions in the reference sequence to be excluded from quantification. The regions will be considered unstranded (strand="*"). Alignments that overlap with a region in mask will not be counted. Masking may reduce the effective width of query regions reported by qCount, even down to zero for regions that are fully contained in mask.

If clObj is set to an object that inherits from class cluster, for example an object returned by makeCluster from package parallel, the quantification task is split into multiple chunks and processed in parallel using clusterMap. Currently, not all tasks will be efficiently parallelized: For example, a single query region and a single (group of) bam files will not be split into multiple chunks.

Examples

Run this code

library(GenomicRanges)
library(Biostrings)
library(Rsamtools)

# copy example data to current working directory
file.copy(system.file(package="QuasR", "extdata"), ".", recursive=TRUE)


# load genome sequence
genomeFile <- "extdata/hg19sub.fa"
gseq <- readDNAStringSet(genomeFile)
chrRegions <- GRanges(names(gseq), IRanges(start=1,width=width(gseq),names=names(gseq)))


# create alignments (paired-end experiment)
sampleFile <- "extdata/samples_rna_paired.txt"
proj <- qAlign(sampleFile, genomeFile, splicedAlignment=TRUE)


# count reads using a "GRanges" query
qCount(proj, query=chrRegions)
qCount(proj, query=chrRegions, useRead="first")


# hierarchical counting using a "GRangesList" query
library(rtracklayer)
annotationFile <- "extdata/hg19sub_annotation.gtf"
gtfRegions <- import.gff(annotationFile, format="gtf", feature.type="exon")
names(gtfRegions) <- mcols(gtfRegions)$source
gtfRegionList <- split(gtfRegions, names(gtfRegions))
names(gtfRegionList)

res3 <- qCount(proj, gtfRegionList)
res3


# gene expression levels using a "TxDb" query
library("GenomicFeatures")
genomeRegion <- scanFaIndex(genomeFile)
chrominfo <- data.frame(chrom=as.character(seqnames(genomeRegion)),
                        length=end(genomeRegion),
                        is_circular=rep(FALSE, length(genomeRegion)))
txdb <- makeTxDbFromGFF(annotationFile, 
                        format="gtf", 
                        chrominfo=chrominfo,
                        dataSource="Ensembl modified",
                        organism="Homo sapiens")

res4 <- qCount(proj, txdb, reportLevel="gene")
res4

# exon-exon junctions
res5 <- qCount(proj, NULL, reportLevel="junction")
res5