AlignSeqs: Align a Set of Unaligned Sequences

Description

Performs profile-to-profile alignment of multiple unaligned sequences following a guide tree.

Usage

AlignSeqs(myXStringSet, guideTree = NULL, iterations = 1, refinements = 1, gapOpening=c(-16, -12), gapExtension=c(-2, -1), structures = NULL, FUN = AdjustAlignment, levels = c(0.95, 0.7, 10, 5), processors = 1, verbose = TRUE, ...)

Arguments

myXStringSet

An AAStringSet, DNAStringSet, or RNAStringSet object of unaligned sequences.

guideTree

Either NULL or a data.frame giving the ordered tree structure in which to align profiles. If NULL then a guide tree will be automatically constructed based on the order of shared k-mers.

iterations

Number of iteration steps to perform. During each iteration step the guide tree is regenerated based on the alignment and the sequences are realigned.

refinements

Number of refinement steps to perform. During each refinement step groups of sequences are realigned to rest of the sequences, and the best of these two alignments (before and after realignment) is kept.

gapOpening

Single numeric giving the cost for opening a gap in the alignment, or two numbers giving the minimum and maximum costs. In the latter case the cost will be varied depending upon whether the groups of sequences being aligned are nearly identical or maximally distant.

gapExtension

Single numeric giving the cost for extending an open gap in the alignment, or two numbers giving the minimum and maximum costs. In the latter case the cost will be varied depending upon whether the groups of sequences being aligned are nearly identical or maximally distant.

structures

Either a list of secondary structure probabilities matching the structureMatrix, such as that output by PredictHEC, or NULL to generate the structures automatically. Only applicable if myXStringSet is an AAStringSet.

FUN

A function to be applied after each profile-to-profile alignment. (See details section below.)

levels

Numeric with four elements specifying the levels above which to apply FUN. (See details section below.)

processors

The number of processors to use, or NULL to automatically detect and use all available processors.

verbose

Logical indicating whether to display progress.

...

Further arguments to be passed directly to AlignProfiles, including perfectMatch, misMatch, gapPower, terminalGap, restrict, anchor, normPower, substitutionMatrix, and structureMatrix.

Value

An XStringSet of aligned sequences.

Details

The profile-to-profile method aligns a sequence set by merging profiles along a guide tree until all the input sequences are aligned. This process has three main steps: (1) If guideTree=NULL, an initial single-linkage guide tree is constructed based on a distance matrix of shared k-mers. If an initial guideTree is provided then the guideTree should be provided in the output given by IdClusters with ascending levels of cutoff. (2) If iterations is greater than zero, then a UPGMA guide tree is built based on the initial alignment and the sequences are re-aligned along this tree. This process repeated iterations times or until convergence. (3) If refinements is greater than zero, then groups of sequences are iteratively realigned to the full-alignment. This process generates two alignments, the best of which is chosen based on its sum-of-pairs score. This refinement process is repeated refinements times, or until no improvement can be made.

The FUN function is applied during each of the three steps based on levels. The purpose of levels is to speed-up the alignment process by not running FUN on the alignment when it is unnecessary. The default levels specify that FUN should be run on the sequences when the initial tree is above 0.95 average dissimilarity, when the iterative tree is above 0.7 average dissimilarity, and after every tenth improvement made during refinement. The final element of levels prevents FUN from being applied at any point to less than 5 sequences. The FUN function is always applied just before returning the alignment, independently of the first three values of levels. The default FUN is AdjustAlignment, but FUN accepts any function that takes in an XStringSet as its first argument, and weights, processors, and substitutionMatrix as optional arguments. For example, the default FUN could be altered to not perform any function by setting it equal to FUN=function(x, ...) return(x) where x is an XStringSet.

References

ES Wright (2015) "DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment". BMC Bioinformatics, doi:10.1186/s12859-015-0749-z.

Examples

Run this code

db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
dna <- SearchDB(db, remove="all")
alignedDNA <- AlignSeqs(dna)
BrowseSeqs(alignedDNA, highlight=1)

Run the code above in your browser using DataLab