blastAllAll: Making BLAST search all against all genomes

Description

Runs a reciprocal all-against-all BLAST search to look for similarity of proteins within and across genomes. The main job is done by the BLAST+ software.

Usage

blastAllAll(prot.files, out.folder, e.value = 1, job = 1, threads = 1,
  verbose = T)

Arguments

prot.files

A text vector with the names of the FASTA files where the protein sequences of each genome is found.

out.folder

The name of the folder where the result files should end up.

e.value

The chosen E-value threshold in BLAST. Default is e.value=1, a smaller value will speed up the search at the cost of less sensitivity.

job

An integer to separate multiple jobs. You may want to run several jobs in parallell, and each job should have different number here to avoid confusion on databases. Default is job=1.

threads

The number of CPU's to use.

verbose

Logical, if TRUE some text output is produced to monitor the progress.

Value

The function produces N*N result files if prot.files lists N sequence files. These result files are located in out.folder. Existing files are never overwritten by blastAllAll, if you want to re-compute something, delete the corresponding result files first.

Details

A basic step in pangenomics and many other comparative studies is to cluster proteins into groups or families. One commonly used approach is based on reciprocal BLASTing. This function uses the BLAST+ software available for free from NCBI (Camacho et al, 2009).

A vector listing FASTA files of protein sequences is given as input in prot.files. These files must have the GID-tag in the first token of every header, and in their filenames as well, i.e. all input files should first be prepared by panPrep to ensure this. Note that only protein sequences are considered here. If your coding genes are stored as DNA, please translate them to protein prior to using this function, see translate.

A BLAST database is made from each genome in turn. Then all genomes are queried against this database, and for every pair of genomes a result file is produced. If two genomes have GID-tags GID111, and GID222 then both result file GID111_vs_GID222.txt and GID222_vs_GID111.txt will be found in out.folder after the completion of this search. This reciprocal (two-way) search is required because of the heuristics of BLAST.

The out.folder is scanned for already existing result files, and blastAllAll never overwrites an existing result file. If a file with the name GID111_vs_GID222.txt already exists in the out.folder, this particular search is skipped. This makes it possible to run multiple jobs in parallell, writing to the same out.folder. It also makes it possible to add new genomes, and only BLAST the new combinations without repeating previous comparisons.

This search can be slow if the genomes contain many proteins and it scales quadratically in the number of input files. It is best suited for the study of a smaller number of genomes (less than say 100). By starting multiple R sessions, you can speed up the search by running blastAllAll from each R session, using the same out.folder but different integers for the job option. If you are using a computing cluster you can also increase the number of CPUs by increasing threads.

The result files are text files, and can be read into R using readBlastTable, but more commonly they are used as input to bDist to compute distances between sequences for subsequent clustering.

References

Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L. (2009). BLAST+: architecture and applications. BMC Bioinformatics, 10:421.

Examples

Run this code

# NOT RUN {
# This example requires the external BLAST+ software
# Using protein files in this package
xpth <- file.path(path.package("micropan"),"extdata")
prot.files <- file.path(xpth,c("Example_proteins_GID1.fasta.xz",
                               "Example_proteins_GID2.fasta.xz",
                               "Example_proteins_GID3.fasta.xz"))

# We need to uncompress them first...
tf <- tempfile(fileext=c("GID1.fasta.xz","GID2.fasta.xz","GID3.fasta.xz"))
s <- file.copy(prot.files,tf)
tf <- unlist(lapply(tf,xzuncompress))

# Blasting all versus all...(requires BLAST+)
tmp.dir <- tempdir()
blastAllAll(tf,out.folder=tmp.dir)

# Reading results, and computing blast.distances
blast.files <- dir(tmp.dir,pattern="GID[0-9]+_vs_GID[0-9]+.txt")
blast.distances <- bDist(file.path(tmp.dir,blast.files))

# ...and cleaning tmp.dir...
s <- file.remove(tf)
s <- file.remove(file.path(tmp.dir,blast.files))
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab