extractTranscriptSeqs
extracts transcript (or CDS) sequences from
an object representing a single chromosome or a collection of chromosomes.
extractTranscriptSeqs(x, transcripts, ...)
"extractTranscriptSeqs"(x, transcripts, strand="+")
"extractTranscriptSeqs"(x, transcripts, ...)
x
can be a DNAString object
(single chromosome), or a BSgenome object (collection
of chromosomes). Other objects representing a collection of chromosomes are supported
(e.g. FaFile objects in the Rsamtools package)
as long as seqinfo
and
getSeq
work on them.
More precisely:
x
is a DNAString object, then
transcripts
must be an RangesList object.
x
is a BSgenome object or any object
representing a collection of chromosomes, then transcripts
must be a GRangesList object or any object
for which exonsBy
is implemented (e.g. a TxDb
object). If the latter, then it's first turned into a
GRangesList object with
exonsBy(transcripts, by="tx", ...)
.
Note that, for each transcript, the exons must be ordered by ascending rank, that is, by ascending position in the transcript (when going in the 5' to 3' direction). This generally means (but not always) that they are also ordered from 5' to 3' on the reference genome. More precisely:
If transcripts
was obtained with exonsBy
(see above),
then the exons are guaranteed to be ordered by ascending rank. See
?exonsBy
for more information.
For the default method, additional arguments are allowed only when
transcripts
is not a GRangesList object,
in which case they are passed to the internal call to exonsBy
(see above).
x
is a DNAString object. Can be an atomic vector, a factor, or an Rle object,
in which case it indicates the strand of each transcript (i.e. all the
exons in a transcript are considered to be on the same strand).
More precisely: it's turned into a factor (or factor-Rle)
that has the "standard strand levels" (this is done by calling the
strand
function on it). Then it's recycled
to the length of RangesList object transcripts
if needed. In the resulting object, the i-th element is interpreted
as the strand of all the exons in the i-th transcript.
strand
can also be a list-like object, in which case it indicates
the strand of each exon, individually. Thus it must have the same
shape as RangesList object transcripts
(i.e. same length plus strand[[i]]
must have the same length
as transcripts[[i]]
for all i
).
strand
can only contain "+"
and/or "-"
values.
"*"
is not allowed.
transcripts
, that is, the i-th element in it is the sequence
of the i-th transcript in transcripts
.
coverageByTranscript
for computing coverage by
transcript (or CDS) of a set of ranges.
transcriptLengths
for extracting the transcript
lengths from a TxDb object.
transcriptLocs2refLocs
function for converting
transcript-based locations into reference-based locations.
available.genomes
function in the
BSgenome package for checking avaibility of BSgenome
data packages (and installing the desired one).
translate
function in the
Biostrings package for translating DNA or RNA sequences
into amino acid sequences.
exonsBy
function for extracting exon ranges
grouped by transcript.
## ---------------------------------------------------------------------
## 1. A TOY EXAMPLE
## ---------------------------------------------------------------------
library(Biostrings)
## A chromosome of length 30:
x <- DNAString("ATTTAGGACACTCCCTGAGGACAAGACCCC")
## 2 transcripts on 'x':
tx1 <- IRanges(1, 8) # 1 exon
tx2 <- c(tx1, IRanges(12, 30)) # 2 exons
transcripts <- IRangesList(tx1=tx1, tx2=tx2)
extractTranscriptSeqs(x, transcripts)
## By default, all the exons are considered to be on the plus strand.
## We can use the 'strand' argument to tell extractTranscriptSeqs()
## to extract them from the minus strand.
## Extract all the exons from the minus strand:
extractTranscriptSeqs(x, transcripts, strand="-")
## Note that, for a transcript located on the minus strand, the exons
## should typically be ordered by descending position on the reference
## genome in order to reflect their rank in the transcript:
extractTranscriptSeqs(x, IRangesList(tx1=tx1, tx2=rev(tx2)), strand="-")
## Extract the exon of the 1st transcript from the minus strand:
extractTranscriptSeqs(x, transcripts, strand=c("-", "+"))
## Extract the 2nd exon of the 2nd transcript from the minus strand:
extractTranscriptSeqs(x, transcripts, strand=list("-", c("+", "-")))
## ---------------------------------------------------------------------
## 2. A REAL EXAMPLE
## ---------------------------------------------------------------------
## Load a genome:
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
## Load a TxDb object:
txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb <- loadDb(txdb_file)
## Check that 'txdb' is based on the hg19 assembly:
txdb
## Extract the exon ranges grouped by transcript from 'txdb':
transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
## Extract the transcript sequences from the genome:
tx_seqs <- extractTranscriptSeqs(genome, transcripts)
tx_seqs
## A sanity check:
stopifnot(identical(width(tx_seqs), unname(sum(width(transcripts)))))
## Note that 'tx_seqs' can also be obtained with:
extractTranscriptSeqs(genome, txdb, use.names=TRUE)
## ---------------------------------------------------------------------
## 3. USING extractTranscriptSeqs() TO EXTRACT CDS SEQUENCES
## ---------------------------------------------------------------------
cds <- cdsBy(txdb, by="tx", use.names=TRUE)
cds_seqs <- extractTranscriptSeqs(genome, cds)
cds_seqs
## A sanity check:
stopifnot(identical(width(cds_seqs), unname(sum(width(cds)))))
## Note that, alternatively, the CDS sequences can be obtained from the
## transcript sequences by removing the 5' and 3' UTRs:
tx_lens <- transcriptLengths(txdb, with.utr5_len=TRUE, with.utr3_len=TRUE)
stopifnot(identical(tx_lens$tx_name, names(tx_seqs))) # sanity
## Keep the rows in 'tx_lens' that correspond to a sequence in 'cds_seqs'
## and put them in the same order as in 'cds_seqs':
m <- match(names(cds_seqs), names(tx_seqs))
tx_lens <- tx_lens[m, ]
utr5_width <- tx_lens$utr5_len
utr3_width <- tx_lens$utr3_len
cds_seqs2 <- narrow(tx_seqs[m],
start=utr5_width+1L, end=-(utr3_width+1L))
stopifnot(identical(as.character(cds_seqs2), as.character(cds_seqs)))
## ---------------------------------------------------------------------
## 4. TRANSLATE THE CDS SEQUENCES
## ---------------------------------------------------------------------
prot_seqs <- translate(cds_seqs, if.fuzzy.codon="solve")
## Note that, by default, translate() uses The Standard Genetic Code to
## translate codons into amino acids. However, depending on the organism,
## a different genetic code might be needed to translate CDS sequences
## located on the mitochodrial chromosome. For example, for vertebrates,
## the following code could be used to correct 'prot_seqs':
SGC1 <- getGeneticCode("SGC1")
chrM_idx <- which(all(seqnames(cds) == "chrM"))
prot_seqs[chrM_idx] <- translate(cds_seqs[chrM_idx], genetic.code=SGC1,
if.fuzzy.codon="solve")
Run the code above in your browser using DataLab