OrientNucleotides: Orient nucleotide sequences

Description

Orients nucleotide sequences to match the directionality and complementarity of specified reference sequences.

Usage

OrientNucleotides(myXStringSet, reference = which.max(width(myXStringSet)), type = "sequences", orientation = "all", threshold = 0.05, verbose = TRUE, processors = 1)

Arguments

myXStringSet

A DNAStringSet or RNAStringSet of unaligned sequences.

reference

The index of reference sequences with the same (desired) orientation. By default the first sequence with maximum width will be used.

type

Character string indicating the type of results desired. This should be (an abbreviation of) either "sequences", "orientations", or "both".

orientation

Character string(s) indicating the allowed reorientation(s) of non-reference sequences. This should be (an abbreviation of) either "all", "reverse", "complement", and/or "both" (for reverse complement).

threshold

Numeric giving the decrease in k-mer distance required to adopt the alternative orientation.

verbose

Logical indicating whether to display progress.

processors

The number of processors to use, or NULL to automatically detect and use all available processors.

Value

OrientNucleotides can return two types of results: the relative orientations of sequences and/or the reoriented sequences. If type is "sequences" (the default) then the reoriented sequences are returned. If type is "orientations" then a character vector is returned that specifies whether sequences were reversed ("r"), complemented ("c"), reversed complemented ("rc"), or in the same orientation ("") as the reference sequences (marked by NA). If type is "both" then the output is a list with the first component containing the "orientations" and the second component containing the "sequences".

Details

Biological sequences can sometimes have inconsistent orientation that interferes with their analysis. OrientNucleotides will reorient sequences by changing their directionality and/or complementarity to match specified reference sequences in the same set. The process works by finding the k-mer distance between the reference sequence(s) and each allowed orientation of the sequences. Alternative orientations that lessen the distance by at least threshold are adopted. Note that this procedure requires a moderately similar reference sequence be available for each sequence that needs to be reoriented. Sequences for which a corresponding reference is unavailable will most likely be left alone because alternative orientations will not pass the threshold. For this reason, it is recommended to specify several markedly different sequences as references.

Examples

Run this code

db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
dna <- SearchDB(db, remove="all")
DNA <- dna # 175 sequences

# reorient subsamples of the first 169 sequences
s <- sample(169, 30)
DNA[s] <- reverseComplement(dna[s])
s <- sample(169, 30)
DNA[s] <- reverse(dna[s])
s <- sample(169, 30)
DNA[s] <- complement(dna[s])

DNA <- OrientNucleotides(DNA, reference=170:175)
DNA==dna # all were correctly reoriented