blastpAllAll: Making BLAST search all against all genomes

Description

Runs a reciprocal all-against-all BLAST search to look for similarity of proteins within and across genomes.

Usage

blastpAllAll(
  prot.files,
  out.folder,
  e.value = 1,
  job = 1,
  threads = 1,
  start.at = 1,
  verbose = TRUE
)

Arguments

prot.files

A vector with FASTA filenames.

out.folder

The folder where the result files should end up.

e.value

The chosen E-value threshold in BLAST.

job

An integer to separate multiple jobs.

threads

The number of CPU's to use.

start.at

An integer to specify where in the file-list to start BLASTing.

verbose

Logical, if TRUE some text output is produced to monitor the progress.

Value

The function produces a result file for each pair of files listed in prot.files. These result files are located in out.folder. Existing files are never overwritten by blastpAllAll, if you want to re-compute something, delete the corresponding result files first.

Details

A basic step in pangenomics and many other comparative studies is to cluster proteins into groups or families. One commonly used approach is based on BLASTing. This function uses the blast+ software available for free from NCBI (Camacho et al, 2009). More precisely, the blastp algorithm with the BLOSUM45 scoring matrix and all composition based statistics turned off.

A vector listing FASTA files of protein sequences is given as input in prot.files. These files must have the genome_id in the first token of every header, and in their filenames as well, i.e. all input files should first be prepared by panPrep to ensure this. Note that only protein sequences are considered here. If your coding genes are stored as DNA, please translate them to protein prior to using this function, see translate.

In the first version of this package we used reciprocal BLASTing, i.e. we computed both genome A against B and B against A. This may sometimes produce slightly different results, but in reality this is too costly compared to its gain, and we now only make one of the above searches. This basically halves the number of searches. This step is still very time consuming for larger number of genomes. Note that the protein files are sorted by the genome_id (part of filename) inside this function. This is to ensure a consistent ordering irrespective of how they are enterred.

For every pair of genomes a result file is produced. If two genomes have genome_id's GID111, and GID222 then the result file GID222_vs_GID111.txt will be found in out.folder after the completion of this search. The last of the two genome_id is always the first in alphabetical order of the two.

The out.folder is scanned for already existing result files, and blastpAllAll never overwrites an existing result file. If a file with the name GID111_vs_GID222.txt already exists in the out.folder, this particular search is skipped. This makes it possible to run multiple jobs in parallell, writing to the same out.folder. It also makes it possible to add new genomes, and only BLAST the new combinations without repeating previous comparisons.

This search can be slow if the genomes contain many proteins and it scales quadratically in the number of input files. It is best suited for the study of a smaller number of genomes. By starting multiple R sessions, you can speed up the search by running blastpAllAll from each R session, using the same out.folder but different integers for the job option. At the same time you may also want to start the BLASTing at different places in the file-list, by giving larger values to the argument start.at. This is 1 by default, i.e. the BLASTing starts at the first protein file. If you are using a multicore computer you can also increase the number of CPUs by increasing threads.

The result files are tab-separated text files, and can be read into R, but more commonly they are used as input to bDist to compute distances between sequences for subsequent clustering.

References

Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L. (2009). BLAST+: architecture and applications. BMC Bioinformatics, 10:421.

Examples

Run this code

# NOT RUN {
# This example requires the external BLAST+ software
# Using protein files in this package
pf <- file.path(path.package("micropan"), "extdata",
                str_c("xmpl_GID", 1:3, ".faa.xz"))

# We need to uncompress them first...
prot.files <- tempfile(fileext = c("_GID1.faa.xz","_GID2.faa.xz","_GID3.faa.xz"))
ok <- file.copy(from = pf, to = prot.files)
prot.files <- unlist(lapply(prot.files, xzuncompress))

# Blasting all versus all
out.dir <- "."
blastpAllAll(prot.files, out.folder = out.dir)

# Reading results, and computing blast.distances
blast.files <- list.files(out.dir, pattern = "GID[0-9]+_vs_GID[0-9]+.txt")
blast.distances <- bDist(file.path(out.dir, blast.files))

# ...and cleaning...
ok <- file.remove(prot.files)
ok <- file.remove(file.path(out.dir, blast.files))
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab