XStringSet-class: XStringSet objects

Description

The BStringSet class is a container for storing a set of BString objects and for making its manipulation easy and efficient.

Similarly, the DNAStringSet (or RNAStringSet, or AAStringSet) class is a container for storing a set of DNAString (or RNAString, or AAString) objects.

All those containers derive directly (and with no additional slots) from the XStringSet virtual class.

Usage

## Constructors:
BStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
DNAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
RNAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
AAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
## Accessor-like methods:
"width"(x)
"nchar"(x, type="chars", allowNA=FALSE)
## ... and more (see below)

Arguments

Either a character vector (with no NAs), or an XString, XStringSet or XStringViews object.

start,end,width

Either NA, a single integer, or an integer vector of the same length as x specifying how x should be "narrowed" (see ?narrow for the details).

use.names

TRUE or FALSE. Should names be preserved?

type,allowNA

Ignored.

Accessor-like methods

In the code snippets below, x is an XStringSet object.

: length(x): The number of sequences in x.
: width(x): A vector of non-negative integers containing the number of letters for each element in x. Note that width(x) is also defined for a character vector with no NAs and is equivalent to nchar(x, type="bytes").
: names(x): NULL or a character vector of the same length as x containing a short user-provided description or comment for each element in x. These are the only data in an XStringSet object that can safely be changed by the user. All the other data are immutable! As a general recommendation, the user should never try to modify an object by accessing its slots directly.
: alphabet(x): Return NULL, DNA_ALPHABET, RNA_ALPHABET or AA_ALPHABET depending on whether x is a BStringSet, DNAStringSet, RNAStringSet or AAStringSet object.
: nchar(x): The same as width(x).

Subsequence extraction and related transformations

In the code snippets below, x is a character vector (with no NAs), or an XStringSet (or XStringViews) object.

: subseq(x, start=NA, end=NA, width=NA): Applies subseq on each element in x. See ?subseq for the details. Note that this is similar to what substr does on a character vector. However there are some noticeable differences: (1) the arguments are start and stop for substr; (2) the SEW interface (start/end/width) interface of subseq is richer (e.g. support for negative start or end values); and (3) subseq checks that the specified start/end/width values are valid i.e., unlike substr, it throws an error if they define "out of limits" subsequences or subsequences with a negative width.
: narrow(x, start=NA, end=NA, width=NA, use.names=TRUE): Same as subseq. The only differences are: (1) narrow has a use.names argument; and (2) all the things narrow and subseq work on (IRanges, XStringSet or XStringViews objects for narrow, XVector or XStringSet objects for subseq). But they both work and do the same thing on an XStringSet object.
: threebands(x, start=NA, end=NA, width=NA): Like the method for IRanges objects, the threebands methods for character vectors and XStringSet objects extend the capability of narrow by returning the 3 set of subsequences (the left, middle and right subsequences) associated to the narrowing operation. See ?threebands in the IRanges package for the details.
: subseq(x, start=NA, end=NA, width=NA) <- value: A vectorized version of the subseq<- method for XVector objects. See ?`subseq<-` for the details.

Subsetting and appending

In the code snippets below, x and values are XStringSet objects, and i should be an index specifying the elements to extract.

: x[i]: Return a new XStringSet object made of the selected elements.
: x[[i]]: Extract the i-th XString object from x.
: append(x, values, after=length(x)): Add sequences in values to x.

Set operations

In the code snippets below, x and y are XStringSet objects.

: union(x, y): Union of x and y.
: intersect(x, y): Intersection of x and y.
: setdiff(x, y): Asymmetric set difference of x and y.
: setequal(x, y): Set equality of x to y.

Other methods

In the code snippets below, x is an XStringSet object.

: unlist(x): Turns x into an XString object by combining the sequences in x together. Fast equivalent to do.call(c, as.list(x)).
: as.character(x, use.names=TRUE): Converts x to a character vector of the same length as x. The use.names argument controls whether or not names(x) should be propagated to the names of the returned vector.
: as.matrix(x, use.names=TRUE): Returns a character matrix containing the "exploded" representation of the strings. Can only be used on an XStringSet object with equal-width strings. The use.names argument controls whether or not names(x) should be propagated to the row names of the returned matrix.
: toString(x): Equivalent to toString(as.character(x)).
: show(x): By default the show method displays 5 head and 5 tail lines. The number of lines can be altered by setting the global options showHeadLines and showTailLines. If the object length is less than the sum of the options, the full object is displayed. These options affect GRanges, GappedAlignments, Ranges and XString objects.

Details

The BStringSet, DNAStringSet, RNAStringSet and AAStringSet functions are constructors that can be used to turn input x into an XStringSet object of the desired base type.

They also allow the user to "narrow" the sequences contained in x via proper use of the start, end and/or width arguments. In this context, "narrowing" means dropping a prefix or/and a suffix of each sequence in x. The "narrowing" capabilities of these constructors can be illustrated by the following property: if x is a character vector (with no NAs), or an XStringSet (or XStringViews) object, then the 3 following transformations are equivalent:

: BStringSet(x, start=mystart, end=myend, width=mywidth)

subseq(BStringSet(x), start=mystart, end=myend, width=mywidth)

BStringSet(subseq(x, start=mystart, end=myend, width=mywidth))

Note that, besides being more convenient, the first form is also more efficient on character vectors.

Examples

Run this code

## ---------------------------------------------------------------------
## A. USING THE XStringSet CONSTRUCTORS ON A CHARACTER VECTOR OR FACTOR
## ---------------------------------------------------------------------
## Note that there is no XStringSet() constructor, but an XStringSet
## family of constructors: BStringSet(), DNAStringSet(), RNAStringSet(),
## etc...
x0 <- c("#CTC-NACCAGTAT", "#TTGA", "TACCTAGAG")
width(x0)
x1 <- BStringSet(x0)
x1

## 3 equivalent ways to obtain the same BStringSet object:
BStringSet(x0, start=4, end=-3)
subseq(x1, start=4, end=-3)
BStringSet(subseq(x0, start=4, end=-3))

dna0 <- DNAStringSet(x0, start=4, end=-3)
dna0
names(dna0)
names(dna0)[2] <- "seqB"
dna0

## When the input vector contains a lot of duplicates, turning it into
## a factor first before passing it to the constructor will produce an
## XStringSet object that is more compact in memory:
library(hgu95av2probe)
x2 <- sample(hgu95av2probe$sequence, 999000, replace=TRUE)
dna2a <- DNAStringSet(x2)
dna2b <- DNAStringSet(factor(x2))  # slower but result is more compact
object.size(dna2a)
object.size(dna2b)

## ---------------------------------------------------------------------
## B. USING THE XStringSet CONSTRUCTORS ON A SINGLE SEQUENCE (XString
##    OBJECT OR CHARACTER STRING)
## ---------------------------------------------------------------------
x3 <- "abcdefghij"
BStringSet(x3, start=2, end=6:2)  # behaves like 'substring(x3, 2, 6:2)'
BStringSet(x3, start=-(1:6))
x4 <- BString(x3)
BStringSet(x4, end=-(1:6), width=3)

## Randomly extract 1 million 40-mers from C. elegans chrI:
extractRandomReads <- function(subject, nread, readlength)
{
    if (!is.integer(readlength))
        readlength <- as.integer(readlength)
    start <- sample(length(subject) - readlength + 1L, nread,
                    replace=TRUE)
    DNAStringSet(subject, start=start, width=readlength)
}
library(BSgenome.Celegans.UCSC.ce2)
rndreads <- extractRandomReads(Celegans$chrI, 1000000, 40)
## Notes:
## - This takes only 2 or 3 seconds versus several hours for a solution
##   using substring() on a standard character string.
## - The short sequences in 'rndreads' can be seen as the result of a
##   simulated high-throughput sequencing experiment. A non-realistic
##   one though because:
##     (a) It assumes that the underlying technology is perfect (the
##         generated reads have no technology induced errors).
##     (b) It assumes that the sequenced genome is exactly the same as the
##         reference genome.
##     (c) The simulated reads can contain IUPAC ambiguity letters only
##         because the reference genome contains them. In a real
##         high-throughput sequencing experiment, the sequenced genome
##         of course doesn't contain those letters, but the sequencer
##         can introduce them in the generated reads to indicate ambiguous
##         base-calling.
##     (d) The simulated reads come from the plus strand only of a single
##         chromosome.
## - See the getSeq() function in the BSgenome package for how to
##   circumvent (d) i.e. how to generate reads that come from the whole
##   genome (plus and minus strands of all chromosomes).

## ---------------------------------------------------------------------
## C. USING THE XStringSet CONSTRUCTORS ON AN XStringSet OBJECT
## ---------------------------------------------------------------------
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)
probes

RNAStringSet(probes, start=2, end=-5)  # does NOT copy the sequence data!

## ---------------------------------------------------------------------
## D. USING THE XStringSet CONSTRUCTORS ON AN ORDINARY list OF XString
##    OBJECTS
## ---------------------------------------------------------------------
probes10 <- head(probes, n=10)
set.seed(33)
shuffled_nucleotides <- lapply(probes10, sample)
shuffled_nucleotides

DNAStringSet(shuffled_nucleotides)  # does NOT copy the sequence data!

## Note that the same result can be obtained in a more compact way with
## just:
set.seed(33)
endoapply(probes10, sample)

## ---------------------------------------------------------------------
## E. USING subseq() ON AN XStringSet OBJECT
## ---------------------------------------------------------------------
subseq(probes, start=2, end=-5)

subseq(probes, start=13, end=13) <- "N"
probes

## Add/remove a prefix:
subseq(probes, start=1, end=0) <- "--"
probes
subseq(probes, end=2) <- ""
probes

## Do more complicated things:
subseq(probes, start=4:7, end=7) <- c("YYYY", "YYY", "YY", "Y")
subseq(probes, start=4, end=6) <- subseq(probes, start=-2:-5)
probes

## ---------------------------------------------------------------------
## F. UNLISTING AN XStringSet OBJECT
## ---------------------------------------------------------------------
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)
unlist(probes)

## ---------------------------------------------------------------------
## G. COMPACTING AN XStringSet OBJECT
## ---------------------------------------------------------------------
## As a particular type of XVectorList objects, XStringSet objects can
## optionally be compacted. Compacting is done typically before
## serialization. See ?compact for more information.
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)

y <- subseq(probes[1:12], start=5)
probes@pool
y@pool
object.size(probes)
object.size(y)

y0 <- compact(y)
y0@pool
object.size(y0)

Run the code above in your browser using DataLab