cigar-utils: CIGAR utility functions

Description

Utility functions for low-level CIGAR manipulation.

Usage

## -=-= Supported CIGAR operations =-=-
CIGAR_OPS
## -=-= Transform CIGARs into other useful representations =-=-
explodeCigarOps(cigar, ops=CIGAR_OPS)
explodeCigarOpLengths(cigar, ops=CIGAR_OPS)
cigarToRleList(cigar)
## -=-= Summarize CIGARs =-=-
cigarOpTable(cigar)
## -=-= From CIGARs to ranges =-=-
cigarRangesAlongReferenceSpace(cigar, flag=NULL, N.regions.removed=FALSE, pos=1L, f=NULL, ops=CIGAR_OPS, drop.empty.ranges=FALSE, reduce.ranges=FALSE, with.ops=FALSE)
cigarRangesAlongQuerySpace(cigar, flag=NULL, before.hard.clipping=FALSE, after.soft.clipping=FALSE, ops=CIGAR_OPS, drop.empty.ranges=FALSE, reduce.ranges=FALSE, with.ops=FALSE)
cigarRangesAlongPairwiseSpace(cigar, flag=NULL, N.regions.removed=FALSE, dense=FALSE, ops=CIGAR_OPS, drop.empty.ranges=FALSE, reduce.ranges=FALSE, with.ops=FALSE)
extractAlignmentRangesOnReference(cigar, pos=1L, drop.D.ranges=FALSE, f=NULL)
## -=-= From CIGARs to sequence lengths =-=-
cigarWidthAlongReferenceSpace(cigar, flag=NULL, N.regions.removed=FALSE)
cigarWidthAlongQuerySpace(cigar, flag=NULL, before.hard.clipping=FALSE, after.soft.clipping=FALSE)
cigarWidthAlongPairwiseSpace(cigar, flag=NULL, N.regions.removed=FALSE, dense=FALSE)
## -=-= Narrow CIGARs =-=-
cigarNarrow(cigar, start=NA, end=NA, width=NA)
cigarQNarrow(cigar, start=NA, end=NA, width=NA)
## -=-= Translate coordinates between query and reference spaces =-=-
queryLoc2refLoc(qloc, cigar, pos=1L)
queryLocs2refLocs(qlocs, cigar, pos=1L, flag=NULL)

Arguments

cigar

A character vector or factor containing the extended CIGAR strings. It can be of arbitrary length except for queryLoc2refLoc which only accepts a single CIGAR (as a character vector or factor of length 1).

ops

Character vector containing the extended CIGAR operations to actually consider. Zero-length operations or operations not listed ops are ignored.

flag

NULL or an integer vector containing the SAM flag for each read.

According to the SAM Spec v1.4, flag bit 0x4 is the only reliable place to tell whether a segment (or read) is mapped (bit is 0) or not (bit is 1). If flag is supplied, then cigarRangesAlongReferenceSpace, cigarRangesAlongQuerySpace, cigarRangesAlongPairwiseSpace, and extractAlignmentRangesOnReference don't produce any range for unmapped reads i.e. they treat them as if their CIGAR was empty (independently of what their CIGAR is). If flag is supplied, then cigarWidthAlongReferenceSpace, cigarWidthAlongQuerySpace, and cigarWidthAlongPairwiseSpace return NAs for unmapped reads.

N.regions.removed

TRUE or FALSE. If TRUE, then cigarRangesAlongReferenceSpace and cigarWidthAlongReferenceSpace report ranges/widths with respect to the "reference" space from which the N regions have been removed, and cigarRangesAlongPairwiseSpace and cigarWidthAlongPairwiseSpace report them with respect to the "pairwise" space from which the N regions have been removed.

pos

An integer vector containing the 1-based leftmost position/coordinate for each (eventually clipped) read sequence. Must have length 1 (in which case it's recycled to the length of cigar), or the same length as cigar.

NULL or a factor of length cigar. If NULL, then the ranges are grouped by alignment i.e. the returned IRangesList object has 1 list element per element in cigar. Otherwise they are grouped by factor level i.e. the returned IRangesList object has 1 list element per level in f and is named with those levels.

For example, if f is a factor containing the chromosome for each read, then the returned IRangesList object will have 1 list element per chromosome and each list element will contain all the ranges on that chromosome.

drop.empty.ranges

Should empty ranges be dropped?

reduce.ranges

Should adjacent ranges coming from the same cigar be merged or not? Using TRUE can significantly reduce the size of the returned object.

with.ops

TRUE or FALSE indicating whether the returned ranges should be named with their corresponding CIGAR operation.

before.hard.clipping

TRUE or FALSE. If TRUE, then cigarRangesAlongQuerySpace and cigarWidthAlongQuerySpace report ranges/widths with respect to the "query" space to which the H regions have been added. before.hard.clipping and after.soft.clipping cannot both be TRUE.

after.soft.clipping

TRUE or FALSE. If TRUE, then cigarRangesAlongQuerySpace and cigarWidthAlongQuerySpace report ranges/widths with respect to the "query" space from which the S regions have been removed. before.hard.clipping and after.soft.clipping cannot both be TRUE.

dense

TRUE or FALSE. If TRUE, then cigarRangesAlongPairwiseSpace and cigarWidthAlongPairwiseSpace report ranges/widths with respect to the "pairwise" space from which the I, D, and N regions have been removed. N.regions.removed and dense cannot both be TRUE.

drop.D.ranges

Should the ranges corresponding to a deletion from the reference (encoded with a D in the CIGAR) be dropped? By default we keep them to be consistent with the pileup tool from SAMtools. Note that, when drop.D.ranges is TRUE, then Ds and Ns in the CIGAR are equivalent.

start,end,width

Vectors of integers. NAs and negative values are accepted and "solved" according to the rules of the SEW (Start/End/Width) interface (see ?solveUserSEW for the details).

qloc

An integer vector containing "query-based locations" i.e. 1-based locations relative to the query sequence stored in the SAM/BAM file.

qlocs

A list of the same length as cigar where each element is an integer vector containing "query-based locations" i.e. 1-based locations relative to the corresponding query sequence stored in the SAM/BAM file.

Value

CIGAR_OPS is a predefined character vector containing the supported extended CIGAR operations: M, I, D, N, S, H, P, =, X. See p. 4 of the SAM Spec v1.4 at http://samtools.sourceforge.net/ for the list of extended CIGAR operations and their meanings.For explodeCigarOps and explodeCigarOpLengths: Both functions return a list of the same length as cigar where each list element is a character vector (for explodeCigarOps) or an integer vector (for explodeCigarOpLengths). The 2 lists have the same shape, that is, same length() and same elementNROWS(). The i-th character vector in the list returned by explodeCigarOps contains one single-letter string per CIGAR operation in cigar[i]. The i-th integer vector in the list returned by explodeCigarOpLengths contains the corresponding CIGAR operation lengths. Zero-length operations or operations not listed in ops are ignored.For cigarToRleList: A CompressedRleList object.For cigarOpTable: An integer matrix with number of rows equal to the length of cigar and nine columns, one for each extended CIGAR operation.For cigarRangesAlongReferenceSpace, cigarRangesAlongQuerySpace, cigarRangesAlongPairwiseSpace, and extractAlignmentRangesOnReference: An IRangesList object (more precisely a CompressedIRangesList object) with 1 list element per element in cigar. However, if f is a factor, then the returned IRangesList object can be a SimpleIRangesList object (instead of CompressedIRangesList), and in that case, has 1 list element per level in f and is named with those levels.For cigarWidthAlongReferenceSpace and cigarWidthAlongPairwiseSpace: An integer vector of the same length as cigar where each element is the width of the alignment with respect to the "reference" and "pairwise" space, respectively. More precisely, for cigarWidthAlongReferenceSpace, the returned widths are the lengths of the alignments on the reference, N gaps included (except if N.regions.removed is TRUE). NAs or "*" in cigar will produce NAs in the returned vector.For cigarWidthAlongQuerySpace: An integer vector of the same length as cigar where each element is the length of the corresponding query sequence as inferred from the CIGAR string. Note that, by default (i.e. if before.hard.clipping and after.soft.clipping are FALSE), this is the length of the query sequence stored in the SAM/BAM file. If before.hard.clipping or after.soft.clipping is TRUE, the returned widths are the lengths of the query sequences before hard clipping or after soft clipping. NAs or "*" in cigar will produce NAs in the returned vector.For cigarNarrow and cigarQNarrow: A character vector of the same length as cigar containing the narrowed cigars. In addition the vector has an "rshift" attribute which is an integer vector of the same length as cigar. It contains the values that would need to be added to the POS field of a SAM/BAM file as a consequence of this cigar narrowing.For queryLoc2refLoc: An integer vector of the same length as qloc containing the "reference-based locations" (i.e. the 1-based locations relative to the reference sequence) corresponding to the "query-based locations" passed in qloc.For queryLocs2refLocs: A list of the same length as qlocs where each element is an integer vector containing the "reference-based locations" corresponding to the "query-based locations" passed in the corresponding element in qlocs.

References

http://samtools.sourceforge.net/

Examples

Run this code

## ---------------------------------------------------------------------
## A. CIGAR_OPS, explodeCigarOps(), explodeCigarOpLengths(),
##    cigarToRleList(), and cigarOpTable()
## ---------------------------------------------------------------------

## Supported CIGAR operations:
CIGAR_OPS

## Transform CIGARs into other useful representations:
cigar1 <- "3H15M55N4M2I6M2D5M6S"
cigar2 <- c("40M2I9M", cigar1, "2S10M2000N15M", "3H33M5H")

explodeCigarOps(cigar2)
explodeCigarOpLengths(cigar2)
explodeCigarOpLengths(cigar2, ops=c("I", "S"))
cigarToRleList(cigar2)

## Summarize CIGARs:
cigarOpTable(cigar2)

## ---------------------------------------------------------------------
## B. From CIGARs to ranges and to sequence lengths
## ---------------------------------------------------------------------

## CIGAR ranges along the "reference" space:
cigarRangesAlongReferenceSpace(cigar1, with.ops=TRUE)[[1]]

cigarRangesAlongReferenceSpace(cigar1,
                               reduce.ranges=TRUE, with.ops=TRUE)[[1]]

ops <- setdiff(CIGAR_OPS, "N")

cigarRangesAlongReferenceSpace(cigar1, ops=ops, with.ops=TRUE)[[1]]

cigarRangesAlongReferenceSpace(cigar1, ops=ops,
                               reduce.ranges=TRUE, with.ops=TRUE)[[1]]

ops <- setdiff(CIGAR_OPS, c("D", "N"))

cigarRangesAlongReferenceSpace(cigar1, ops=ops, with.ops=TRUE)[[1]]

cigarWidthAlongReferenceSpace(cigar1)

pos2 <- c(1, 1001, 1,  351)

cigarRangesAlongReferenceSpace(cigar2, pos=pos2, with.ops=TRUE)

res1a <- extractAlignmentRangesOnReference(cigar2, pos=pos2)
res1b <- cigarRangesAlongReferenceSpace(cigar2,
                               pos=pos2,
                               ops=setdiff(CIGAR_OPS, "N"),
                               reduce.ranges=TRUE)
stopifnot(identical(res1a, res1b))

res2a <- extractAlignmentRangesOnReference(cigar2, pos=pos2,
                               drop.D.ranges=TRUE)
res2b <- cigarRangesAlongReferenceSpace(cigar2,
                               pos=pos2,
                               ops=setdiff(CIGAR_OPS, c("D", "N")),
                               reduce.ranges=TRUE)
stopifnot(identical(res2a, res2b))

seqnames <- factor(c("chr6", "chr6", "chr2", "chr6"),
                   levels=c("chr2", "chr6"))
extractAlignmentRangesOnReference(cigar2, pos=pos2, f=seqnames)

## CIGAR ranges along the "query" space:
cigarRangesAlongQuerySpace(cigar2, with.ops=TRUE)
cigarWidthAlongQuerySpace(cigar1)
cigarWidthAlongQuerySpace(cigar1, before.hard.clipping=TRUE)

## CIGAR ranges along the "pairwise" space:
cigarRangesAlongPairwiseSpace(cigar2, with.ops=TRUE)
cigarRangesAlongPairwiseSpace(cigar2, dense=TRUE, with.ops=TRUE)

## ---------------------------------------------------------------------
## C. COMPUTE THE COVERAGE OF THE READS STORED IN A BAM FILE
## ---------------------------------------------------------------------
## The information stored in a BAM file can be used to compute the
## "coverage" of the mapped reads i.e. the number of reads that hit any
## given position in the reference genome.
## The following function takes the path to a BAM file and returns an
## object representing the coverage of the mapped reads that are stored
## in the file. The returned object is an RleList object named with the
## names of the reference sequences that actually receive some coverage.

flag0 <- scanBamFlag(isUnmappedQuery=FALSE, isDuplicate=FALSE)

extractCoverageFromBAM <- function(bamfile)
{
  stopifnot(is(bamfile, "BamFile"))
  ## This ScanBamParam object allows us to load only the necessary
  ## information from the file.
  param <- ScanBamParam(flag=flag0, what=c("rname", "pos", "cigar"))
  bam <- scanBam(bamfile, param=param)[[1]]
  ## Note that unmapped reads and reads that are PCR/optical duplicates
  ## have already been filtered out by using the ScanBamParam object
  ## above.
  f <- factor(bam$rname, levels=seqlevels(bamfile))
  irl <- extractAlignmentRangesOnReference(bam$cigar, pos=bam$pos, f=f)
  coverage(irl, width=seqlengths(bamfile))
}

library(Rsamtools)
f1 <- system.file("extdata", "ex1.bam", package="Rsamtools")
cvg <- extractCoverageFromBAM(BamFile(f1))

## extractCoverageFromBAM() is equivalent but slightly more efficient
## than loading a GAlignments object and computing its coverage:
cvg2 <- coverage(readGAlignments(f1, param=ScanBamParam(flag=flag0)))
stopifnot(identical(cvg, cvg2))

## ---------------------------------------------------------------------
## D. cigarNarrow() and cigarQNarrow()
## ---------------------------------------------------------------------

## cigarNarrow():
cigarNarrow(cigar1)  # only drops the soft/hard clipping
cigarNarrow(cigar1, start=10)
cigarNarrow(cigar1, start=15)
cigarNarrow(cigar1, start=15, width=57)
cigarNarrow(cigar1, start=16)
#cigarNarrow(cigar1, start=16, width=55)  # ERROR! (empty cigar)
cigarNarrow(cigar1, start=71)
cigarNarrow(cigar1, start=72)
cigarNarrow(cigar1, start=75)

## cigarQNarrow():
cigarQNarrow(cigar1, start=4, end=-3)
cigarQNarrow(cigar1, start=10)
cigarQNarrow(cigar1, start=19)
cigarQNarrow(cigar1, start=24)

## ---------------------------------------------------------------------
## E. PERFORMANCE
## ---------------------------------------------------------------------

if (interactive()) {
  ## We simulate 20 millions aligned reads, all 40-mers. 95% of them
  ## align with no indels. 5% align with a big deletion in the
  ## reference. In the context of an RNAseq experiment, those 5% would
  ## be suspected to be "junction reads".
  set.seed(123)
  nreads <- 20000000L
  njunctionreads <- nreads * 5L / 100L
  cigar3 <- character(nreads)
  cigar3[] <- "40M"
  junctioncigars <- paste(
      paste(10:30, "M", sep=""),
      paste(sample(80:8000, njunctionreads, replace=TRUE), "N", sep=""),
      paste(30:10, "M", sep=""), sep="")
  cigar3[sample(nreads, njunctionreads)] <- junctioncigars
  some_fake_rnames <- paste("chr", c(1:6, "X"), sep="")
  rname <- factor(sample(some_fake_rnames, nreads, replace=TRUE),
                  levels=some_fake_rnames)
  pos <- sample(80000000L, nreads, replace=TRUE)

  ## The following takes < 3 sec. to complete:
  system.time(irl1 <- extractAlignmentRangesOnReference(cigar3, pos=pos))

  ## The following takes < 4 sec. to complete:
  system.time(irl2 <- extractAlignmentRangesOnReference(cigar3, pos=pos,
                                                        f=rname))

  ## The sizes of the resulting objects are about 240M and 160M,
  ## respectively:
  object.size(irl1)
  object.size(irl2)
}

Run the code above in your browser using DataLab