
?available.genomes
for how to get the list of
"BSgenome data packages" curently available.
x
is a BSgenome object.
Note that, because the BSgenome class contains the
GenomeDescription class, then all the accessor methods
for GenomeDescription objects can also be used on
x
. sourceUrl(x)
x
can
be found (and downloaded).
seqnames(x)
, seqnames(x) <- value
x
.
Each single sequence is stored in a DNAString
or MaskedDNAString object and typically comes
from a source file (FASTA) with a single record.
The names returned by seqnames(x)
usually reflect the names
of those source files but a common prefix or suffix was eventually
removed in order to keep them as short as possible.
seqlengths(x)
x
. See ?`length,XVector-method`
and
?`length,MaskedXString-method`
for
the definition of the length of a DNAString
or MaskedDNAString object.
Note that the length of a masked sequence
(MaskedXString object) is not
affected by the current set of active masks but the nchar
method for MaskedXString objects is. names(seqlengths(x))
is guaranteed to be identical to
seqnames(x)
.
mseqnames(x)
x
.
Each multiple sequence is stored in a
DNAStringSet object and typically comes from
a source file (FASTA) with multiple records.
The names returned by mseqnames(x)
usually reflect the names
of those source files but a common prefix or suffix was eventually
removed in order to keep them as short as possible.
names(x)
x
.
This is the same as c(seqnames(x), mseqnames(x))
.
length(x)
x
, i.e., the total number of sequences
in it (single and multiple sequences). This is the same as
length(names(x))
.
x[[name]]
x
named name
(name
must be a single string).
No sequence is actually loaded into memory until this is explicitely
requested with a call to x[[name]]
or x$name
.
When loaded, a sequence is kept in a cache. It will be automatically
removed from the cache at garbage collection if it's not in use anymore
i.e. if there are no reference to it (other than the reference stored
in the cache). With options(verbose=TRUE)
, a message is printed
each time a sequence is removed from the cache.
x$name
x[[name]]
but name
is not evaluated and
therefore must be a literal character string or a name (possibly
backtick quoted).
masknames(x)
masknames(x)
gives the names of the masks in this collection.
Therefore the value returned by masknames(x)
is a character vector
made of the first N elements of c("AGAPS", "AMB", "RM", "TRF")
,
where N depends only on the BSgenome data package being looked at
(0 <= n="" <="4)." the="" man="" page="" for="" most="" bsgenome="" data="" packages="" should="" provide="" exact="" list="" and="" permanent="" urls="" of="" source="" files="" that="" were="" used="" to="" extract="" built-in="" masks.="" example,="" if="" you've="" installed="" bsgenome.hsapiens.ucsc.hg38="" package,="" load="" it="" see="" note="" section="" in="" ?`BSgenome.Hsapiens.UCSC.hg38`.
available.genomes
,
GenomeDescription-class,
BSgenome-utils,
DNAString-class,
DNAStringSet-class,
MaskedDNAString-class,
getSeq,BSgenome-method
,
injectSNPs
,
subseq,XVector-method,
rm
,
gc
## Loading a BSgenome data package doesn't load its sequences
## into memory:
library(BSgenome.Celegans.UCSC.ce2)
## Number of sequences in this genome:
length(Celegans)
## Display a summary of the sequences:
Celegans
## Index of single sequences:
seqnames(Celegans)
## Lengths (i.e. number of nucleotides) of the single sequences:
seqlengths(Celegans)
## Load chromosome I from disk to memory (hence takes some time)
## and keep a reference to it:
chrI <- Celegans[["chrI"]] # equivalent to Celegans$chrI
chrI
class(chrI) # a DNAString instance
length(chrI) # with 15080483 nucleotides
## Single sequence can be renamed:
seqnames(Celegans) <- sub("^chr", "", seqnames(Celegans))
seqlengths(Celegans)
Celegans$I
seqnames(Celegans) <- paste0("chr", seqnames(Celegans))
## Multiple sequences:
library(BSgenome.Rnorvegicus.UCSC.rn5)
rn5 <- BSgenome.Rnorvegicus.UCSC.rn5
rn5
seqnames(rn5)
rn5_chr1 <- rn5$chr1
mseqnames(rn5)
rn5_random <- Rnorvegicus$random
rn5_random
class(rn5_random) # a DNAStringSet instance
## Character vector containing the description lines of the first
## 4 sequences in the original FASTA file:
names(rn5_random)[1:4]
## ---------------------------------------------------------------------
## PASS-BY-ADDRESS SEMANTIC, CACHING AND MEMORY USAGE
## ---------------------------------------------------------------------
## We want a message to be printed each time a sequence is removed
## from the cache:
options(verbose=TRUE)
gc() # nothing seems to be removed from the cache
rm(rn5_chr1, rn5_random)
gc() # rn5_chr1 and rn5_random are removed from the cache (they are
# not in use anymore)
options(verbose=FALSE)
## Get the current amount of data in memory (in Mb):
mem0 <- gc()["Vcells", "(Mb)"]
system.time(rn5_chr2 <- rn5$chr2) # read from disk
gc()["Vcells", "(Mb)"] - mem0 # 'rn5_chr2' occupies 20Mb in memory
system.time(tmp <- rn5$chr2) # much faster! (sequence
# is in the cache)
gc()["Vcells", "(Mb)"] - mem0 # we're still using 20Mb (sequences
# have a pass-by-address semantic
# i.e. the sequence data are not
# duplicated)
## subseq() doesn't copy the sequence data either, hence it is very
## fast and memory efficient (but the returned object will hold a
## reference to 'rn5_chr2'):
y <- subseq(rn5_chr2, 10, 8000000)
gc()["Vcells", "(Mb)"] - mem0
## We must remove all references to 'rn5_chr2' before it can be
## removed from the cache (so the 20Mb of memory used by this
## sequence are freed):
options(verbose=TRUE)
rm(rn5_chr2, tmp)
gc()
## Remember that 'y' holds a reference to 'rn5_chr2' too:
rm(y)
gc()
options(verbose=FALSE)
gc()["Vcells", "(Mb)"] - mem0
Run the code above in your browser using DataLab