Learn R Programming

seqinr (version 4.2-16)

read.alignment: Read aligned sequence files in mase, clustal, phylip, fasta or msf format

Description

Read a file in mase, clustal, phylip, fasta or msf format. These formats are used to store nucleotide or protein multiple alignments.

Usage

read.alignment(file, format, forceToLower = TRUE, ...)

Value

An object of class alignment which is a list with the following components:

nb

the number of aligned sequences

nam

a vector of strings containing the names of the aligned sequences

seq

a vector of strings containing the aligned sequences

com

a vector of strings containing the commentaries for each sequence or NA if there are no comments

Arguments

file

the name of the file which the aligned sequences are to be read from. If it does not contain an absolute or relative path, the file name is relative to the current working directory, getwd.

format

a character string specifying the format of the file : mase, clustal, phylip, fasta or msf

forceToLower

a logical defaulting to TRUE stating whether the returned characters in the sequence should be in lower case (introduced in seqinR release 1.1-3).

...

For the fasta format, extra arguments are passed to the read.fasta function.

Author

D. Charif, J.R. Lobry

Details

"mase"

The mase format is used to store nucleotide or protein multiple alignments. The beginning of the file must contain a header containing at least one line (but the content of this header may be empty). The header lines must begin by ;;. The body of the file has the following structure: First, each entry must begin by one (or more) commentary line. Commentary lines begin by the character ;. Again, this commentary line may be empty. After the commentaries, the name of the sequence is written on a separate line. At last, the sequence itself is written on the following lines.

"clustal"

The CLUSTAL format (*.aln) is the format of the ClustalW multialignment tool output. It can be described as follows. The word CLUSTAL is on the first line of the file. The alignment is displayed in blocks of a fixed length, each line in the block corresponding to one sequence. Each line of each block starts with the sequence name (maximum of 10 characters), followed by at least one space character. The sequence is then displayed in upper or lower cases, '-' denotes gaps. The residue number may be displayed at the end of the first line of each block.

"msf"

MSF is the multiple sequence alignment format of the GCG sequence analysis package. It begins with the line (all uppercase) !!NA\_MULTIPLE\_ALIGNMENT 1.0 for nucleic acid sequences or !!AA\_MULTIPLE\_ALIGNMENT 1.0 for amino acid sequences. Do not edit or delete the file type if its present.(optional). A description line which contains informative text describing what is in the file. You can add this information to the top of the MSF file using a text editor.(optional) A dividing line which contains the number of bases or residues in the sequence, when the file was created, and importantly, two dots (..) which act as a divider between the descriptive information and the following sequence information.(required) msf files contain some other information: the Name/Weight, a Separating Line which must include two slashes (//) to divide the name/weight information from the sequence alignment.(required) and the multiple sequence alignment.

"phylip"

PHYLIP is a tree construction program. The format is as follows: the number of sequences and their length (in characters) is on the first line of the file. The alignment is displayed in an interleaved or sequential format. The sequence names are limited to 10 characters and may contain blanks.

"fasta"

Sequence in fasta format begins with a single-line description (distinguished by a greater-than (>) symbol), followed by sequence data on the next line.

References

citation("seqinr")

See Also

To read aligned sequences in NEXUS format, see the function read.nexus that was available in the CompPairWise package (not sure it is still maintained as of 09/09/09). The NEXUS format was mainly used by the non-GPL commercial PAUP software.

Related functions: as.matrix.alignment, read.fasta, write.fasta, reverse.align, dist.alignment.

Examples

Run this code
mase.res   <- read.alignment(file = system.file("sequences/test.mase", package = "seqinr"),
 format = "mase")
clustal.res <- read.alignment(file = system.file("sequences/test.aln", package = "seqinr"),
 format="clustal")
phylip.res  <- read.alignment(file = system.file("sequences/test.phylip", package = "seqinr"),
 format = "phylip")
msf.res      <- read.alignment(file = system.file("sequences/test.msf", package = "seqinr"),
 format = "msf")
fasta.res    <- read.alignment(file = system.file("sequences/Anouk.fasta", package = "seqinr"),
 format = "fasta")

#
# Quality control routine sanity checks:
#

data(mase); stopifnot(identical(mase, mase.res))
data(clustal); stopifnot(identical(clustal, clustal.res))
data(phylip); stopifnot(identical(phylip, phylip.res))
data(msf); stopifnot(identical(msf, msf.res))
data(fasta); stopifnot(identical(fasta, fasta.res))

#
# Example of using extra arguments from the read.fasta function, here to keep
# whole headers for sequences names.
#

whole.header.test <- 
 read.alignment(file = system.file("sequences/LTPs128_SSU_aligned_First_Two.fasta", 
 package = "seqinr"), format = "fasta", whole.header = TRUE)
whole.header.test$nam

# Sould be:
#
# [1] "D50541\t1\t1411\t1411bp\trna\tAbiotrophia defectiva\tAerococcaceae"      
# [2] "KP233895\t1\t1520\t1520bp\trna\tAbyssivirga alkaniphila\tLachnospiraceae"
#

Run the code above in your browser using DataLab