XStringSet-comparison: Comparing and ordering the elements in one or more XStringSet objects

Description

Methods for comparing and ordering the elements in one or more XStringSet objects.

Arguments

<code>pcompare()</code> and related methods

In the code snippets below, x and y are XStringSet objects.

: pcompare(x, y): Performs element-wise (aka "parallel") comparison of x and y, that is, returns an integer vector where the i-th element is less than, equal to, or greater than zero if the i-th element in x is considered to be respectively less than, equal to, or greater than the i-th element in y. If x and y don't have the same length, then the shortest is recycled to the length of the longest (the standard recycling rules apply).
: x == y, x != y, x <= y<="" code="">, x >= y, x < y, x > y: Equivalent to pcompare(x, y) == 0, pcompare(x, y) != 0, pcompare(x, y) <= 0<="" code="">, pcompare(x, y) >= 0, pcompare(x, y) < 0, and pcompare(x, y) > 0, respectively.

<code>order()</code> and related methods

In the code snippets below, x is an XStringSet object.

: is.unsorted(x, strictly=FALSE): Return a logical values specifying if x is unsorted. The strictly argument takes logical value indicating if the check should be for _strictly_ increasing values.
: order(x, decreasing=FALSE): Return a permutation which rearranges x into ascending or descending order.
: rank(x, ties.method=c("first", "min")): Rank x in ascending order.
: sort(x, decreasing=FALSE): Sort x into ascending or descending order.

<code>duplicated()</code> and <code>unique()</code>

In the code snippets below, x is an XStringSet object.

: duplicated(x): Return a logical vector whose elements denotes duplicates in x.
: unique(x): Return the subset of x made of its unique elements.

<code>match()</code> and <code>%in%</code>

In the code snippets below, x and table are XStringSet objects.

: match(x, table, nomatch=NA_integer_): Returns an integer vector containing the first positions of an identical match in table for the elements in x.
: x %in% table: Returns a logical vector indicating which elements in x match identically with an element in table.

Details

Element-wise (aka "parallel") comparison of 2 XStringSet objects is based on the lexicographic order between 2 BString, DNAString, RNAString, or AAString objects.

For DNAStringSet and RNAStringSet objects, the letters in the respective alphabets (i.e. DNA_ALPHABET and RNA_ALPHABET) are ordered based on a predefined code assigned to each letter. The code assigned to each letter can be retrieved with:

  dna_codes <- as.integer(DNAString(paste(DNA_ALPHABET, collapse="")))
  names(dna_codes) <- DNA_ALPHABET

rna_codes <- as.integer(RNAString(paste(RNA_ALPHABET, collapse=""))) names(rna_codes) <- RNA_ALPHABET Note that this order does NOT depend on the locale in use. Also note that comparing DNA sequences with RNA sequences is supported and in that case T and U are considered to be the same letter.

For BStringSet and AAStringSet objects, the alphabetical order is defined by the C collation. Note that, at the moment, AAStringSet objects are treated like BStringSet objects i.e. the alphabetical order is NOT defined by the order of the letters in AA_ALPHABET. This might change at some point.

Examples

Run this code

## ---------------------------------------------------------------------
## A. SIMPLE EXAMPLES
## ---------------------------------------------------------------------

dna <- DNAStringSet(c("AAA", "TC", "", "TC", "AAA", "CAAC", "G"))
match(c("", "G", "AA", "TC"), dna)

library(drosophila2probe)
fly_probes <- DNAStringSet(drosophila2probe)
sum(duplicated(fly_probes))  # 481 duplicated probes

is.unsorted(fly_probes)  # TRUE
fly_probes <- sort(fly_probes)
is.unsorted(fly_probes)  # FALSE
is.unsorted(fly_probes, strictly=TRUE)  # TRUE, because of duplicates
is.unsorted(unique(fly_probes), strictly=TRUE)  # FALSE

## Nb of probes that are the reverse complement of another probe:
nb1 <- sum(reverseComplement(fly_probes) %in% fly_probes)
stopifnot(identical(nb1, 455L))  # 455 probes

## Probes shared between drosophila2probe and hgu95av2probe:
library(hgu95av2probe)
human_probes <- DNAStringSet(hgu95av2probe)
m <- match(fly_probes, human_probes)
stopifnot(identical(sum(!is.na(m)), 493L))  # 493 shared probes

## ---------------------------------------------------------------------
## B. AN ADVANCED EXAMPLE
## ---------------------------------------------------------------------
## We want to compare the first 5 bases with the 5 last bases of each
## probe in drosophila2probe. More precisely, we want to compute the
## percentage of probes for which the first 5 bases are the reverse
## complement of the 5 last bases.

library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)

first5 <- narrow(probes, end=5)
last5 <- narrow(probes, start=-5)
nb2 <- sum(first5 == reverseComplement(last5))
stopifnot(identical(nb2, 17L))

## Percentage:
100 * nb2 / length(probes)  # 0.0064 %

## If the probes were random DNA sequences, a probe would have 1 chance
## out of 4^5 to have this property so the percentage would be:
100 / 4^5  # 0.098 %

## With randomly generated probes:
set.seed(33)
random_dna <- sample(DNAString(paste(DNA_BASES, collapse="")),
                     sum(width(probes)), replace=TRUE)
random_probes <- successiveViews(random_dna, width(probes))
random_probes
random_probes <- as(random_probes, "XStringSet")
random_probes

random_first5 <- narrow(random_probes, end=5)
random_last5 <- narrow(random_probes, start=-5)

nb3 <- sum(random_first5 == reverseComplement(random_last5))
100 * nb3 / length(random_probes)  # 0.099 %

Run the code above in your browser using DataLab