seqaln: Sequence Alignment with MUSCLE

Description

Create multiple alignments of amino acid or nucleotide sequences according to the method of Edgar.

Usage

seqaln(aln, id=NULL, profile=NULL, exefile="muscle", outfile="aln.fa", 
       protein=TRUE, seqgroup=FALSE, refine=FALSE, extra.args="",
       verbose=FALSE)

Arguments

aln

a sequence character matrix, as obtained from seqbind, or an alignment list object as obtained from read.fasta.

a vector of sequence names to serve as sequence identifers.

profile

a profile alignment of class fasta (e.g. obtained from read.fasta). The alignment aln will be added to the profile.

exefile

file path to the MUSCLE program on your system (i.e. how is MUSCLE invoked). Alternatively, CLUSTALO can be used.

outfile

name of FASTA output file to which alignment should be written.

protein

logical, if TRUE the input sequences are assumed to be protein not DNA or RNA.

seqgroup

logical, if TRUE similar sequences are grouped together in the output.

refine

logical, if TRUE the input sequences are assumed to already be aligned, and only tree dependent refinement is performed.

extra.args

a single character string containing extra command line arguments for the alignment program.

verbose

logical, if TRUE MUSCLE warning and error messages are printed.

Value

A list with two components:
alian alignment character matrix with a row per sequence and a column per equivalent aminoacid/nucleotide.
idssequence names as identifers.
callthe matched call.

Details

Sequence alignment attempts to arrange the sequences of protein, DNA or RNA, to highlight regions of shared similarity that may reflect functional, structural, and/or evolutionary relationships between the sequences.

Aligned sequences are represented as rows within a matrix. Gaps (-) are inserted between the aminoacids or nucleotides so that equivalent characters are positioned in the same column.

This function calls the MUSCLE program, to perform a multiple sequence alignment, which MUST BE INSTALLED on your system and in the search path for executables. If you have a large number of input sequences (a few thousand), or they are very long, the default settings may be too slow for practical use. A good compromise between speed and accuracy is to run just the first two iterations of the MUSCLE algorithm by setting the extra.args argument to -maxiters 2.

You can set MUSCLE to improve an existing alignment by setting refine to TRUE.

To inspect the sequence clustering used by MUSCLE to produce alignments, include -tree2 tree.out in the extra.args argument. You can then load the tree.out file with the read.tree function from the ape package.

CLUSTALO can be used as an alternative to MUSCLE by specifiying exefile='clustalo'. This might be useful e.g. when adding several sequences to a profile alignment.

References

Grant, B.J. et al. (2006) Bioinformatics 22, 2695--2696. MUSCLE is the work of Edgar: Edgar (2004) Nuc. Acid. Res. 32, 1792--1797. Full details of the MUSCLE algorithm, along with download and installation instructions can be obtained from: http://www.drive5.com/muscle.

Examples

Run this code

##-- Basic sequence alignemnt
seqs <- get.seq(c("4q21_A", "1ftn_A"))
aln <- seqaln(seqs)

##-- add a sequence to the (profile) alignment
seq <- get.seq("1tnd_A")
aln <- seqaln(seq, profile=aln)


##-- Read a folder/directory of PDB files
#pdb.path <- "my_dir_of_pdbs"
#files  <- list.files(path=pdb.path ,
#                     pattern=".pdb",
#                     full.names=TRUE)

##-- Use online files
files <- get.pdb(c("4q21","1ftn"), URLonly=TRUE)

##-- Extract and store sequences
raw <- NULL
for(i in 1:length(files)) {
  pdb <- read.pdb(files[i])
  raw <- seqbind(raw, pdbseq(pdb) )
}

##-- Align these sequences
aln <- seqaln(raw, id=files, outfile="seqaln.fa")

##-- Read Aligned PDBs storing coordinate data
pdbs <- read.fasta.pdb(aln) 

## Sequence identity
seqidentity(aln)

## Note that all the above can be done with the pdbaln() function:
#pdbs <- pdbaln(files)


##-- For identical sequences with masking use a custom matrix
aln <- list(ali=seqbind(c("X","C","X","X","A","G","K"),
                        c("C","-","A","X","G","X","X","K")),
            id=c("a","b"))

temp <- seqaln(aln=aln, outfile="temp.fas", protein=TRUE,
               extra.args= paste("-matrix",
                system.file("matrices/custom.mat", package="bio3d"),
                "-gapopen -3.0 ",
                "-gapextend -0.5",
                "-center 0.0") )

Run the code above in your browser using DataLab