
detectBreakpoints(chimericReads, bpDist=100, minClusterSize=4, removeSoftClips=TRUE, bsGenome)
filterChimericReads
. The list must have the format as
defined by the scanBam
method.bsGenome
instance providing the reference
sequences. If missing, the library
BSgenome.Hsapiens.UCSC.hg19
is used by default.detectBreakpoints
returns an object of class
breakpoints
, which is a list of breakpoint clusters, which gives
access to all alignments and consensus breakpoints:IRanges
DataFrame
is
mainly a rearranged version of the alignment input in
chimericReads
. In addition, it shows the corresponding
breakpoints and local alignment coordinates.PairwiseAlignmentsSingleSubject
of the
Biostrings package that contains the alignment to the
(best) consensus reference sequence.commonAlign
and commonBps
,
alignedReads
is an instance of class AlignedRead
containing all aligned reads including their associated chromosomes, strands, and
positions. Since the reference is a chimeric sequence each read has
two chromosome and two strand entries.filterChimericReads
and before calling
mergeBreakpoints
. It first forms clusters of chimeric
reads (reads with exactly two local alignments) that span the same
breakpoint and than computes a consensus breakpoint sequence for each
cluster.
To carry out a hierarchical clustering, a measure for the distance
between two chimeric reads must be defined. If reads span different
chromosomes, their distance is set to infinity. The strand
information of the local alignments may also indicate that two chimeric
reads do not span the same breakpoint even if they span the same
chromosomes. For example, the first reads has two local alignments on
the positive strand whereas the second read has one local alignment on
the positive strand and the other on the negative strand. In this case,
the distance is set to inifinty, too. Finally, the distance measure
distinguishes between the two breakpoints (sometimes
called the pathogenic and the reciproce breakpoint) that originate from
the same structual variant. The distance between a read from the
pathogenic and a read from the reciproce breakpoint is infinity so that
two different clusters will emerge. These two related breakpoints can be
merge later using the mergeBreakpoints
method. We observed
that the breakpoints of these two cases often differ by a few ten or even
a few hundred basepairs.
If the chromosome and strand information between two reads $x$ and
$y$ are coherent, the Euclidian distance is used:
bpDist
to
obtain the final clusters. The bpDist
argument does usually not
influence the result, because we observed that reads spanning the same
breakpoint have very little variation (only a few base pairs) in their
local alignments due to sequencing errors or due to ambiguity caused by
same/similar sequence of both chromosome near the breakpoint.
Although the given set of reads may belong to the same chimeric DNA,
their individual breakpoints may differ in a few base pairs.
Furthermore, a single read may have more than one possible breakpoint
if a (small) part of the read was aligned to both parts.
The following step determines a consensus breakpoint for each cluster.
It uses the supplied bsGenome
to construct a
chimeric reference sequence for all possible breakpoints over all reads within
each cluster. After the reads were realigned to the chimeric reference
sequences, the one that yields the highest alignment score is taken to
represent best the chimeric DNA and its breakpoints.
As a preprocessing step, detectBreakpoints
offers to remove soft
clips occuring after the alignment:
Some reads may contain soft-clipped bases (e.g. linker sequences) at the beginning of the first part
of the read or at the end of the second part. By default,
detectBreakpoints
removes these unaligned subsequences
and adjusts the cigar string, the sequence, the sequence width
(qwidth) and the local start/end coordinates.filterChimericReads
mergeBreakpoints
plotChimericReads
# Construct a small example with three chimeric reads
# (=6 local alignments) in bam format as given by
# aligners such as BWA-SW.
# The first two reads originate from the same case but
# from different strands. The third read originate from
# the reciprocal breakpoint.
library("BSgenome.Scerevisiae.UCSC.sacCer2")
bamReads = list()
bamReads[[1]] = list(
qname=c("seq1", "seq1", "seq2", "seq2", "seq3", "seq3"),
flag = as.integer(c(0, 0, 16, 16, 0, 0)),
rname = factor(c("II", "III", "III", "II", "III", "II")),
strand = factor(c("+", "+", "-", "-", "+", "+")),
pos = as.integer(c(99951, 200000, 200000, 99951, 199950, 100001)),
qwidth = as.integer(c(100, 100, 100, 100, 100, 100)),
cigar = c("50M50S","50S50M","50S50M","50M50S","50M50S", "50S50M"),
seq = DNAStringSet(c(
paste(substr(Scerevisiae$chrII, start=99951, stop=100000),
substr(Scerevisiae$chrIII, start=200000, stop=200049),
sep=""),
paste(substr(Scerevisiae$chrII, start=99951, stop=100000),
substr(Scerevisiae$chrIII, start=200000, stop=200049),
sep=""),
paste(substr(Scerevisiae$chrIII, start=200000, stop=200049),
substr(Scerevisiae$chrII, start=99951, stop=100000),
sep=""),
paste(substr(Scerevisiae$chrIII, start=200000, stop=200049),
substr(Scerevisiae$chrII, start=99951, stop=100000),
sep=""),
paste(substr(Scerevisiae$chrIII, start=199950, stop=199999),
substr(Scerevisiae$chrII, start=100001, stop=100050),
sep=""),
paste(substr(Scerevisiae$chrIII, start=199950, stop=199999),
substr(Scerevisiae$chrII, start=100001, stop=100050),
sep="")))
)
bps = detectBreakpoints(bamReads, minClusterSize=1, bsGenome=Scerevisiae)
summary(bps)
table(bps)
mergeBreakpoints(bps)
Run the code above in your browser using DataLab