A basic step in pangenomics and many other comparative studies is to cluster proteins into
groups or families. One commonly used approach is based on reciprocal BLASTing. This function uses the
BLAST+ software available for free from NCBI (Camacho et al, 2009).
A vector listing FASTA files of protein sequences is given as input in prot.files. These files
must have the GID-tag in the first token of every header, and in their filenames as well, i.e. all input
files should first be prepared by panPrep
to ensure this. Note that only protein sequences
are considered here. If your coding genes are stored as DNA, please translate them to protein prior to
using this function, see translate
.
A BLAST database is made from each genome in turn. Then all genomes are queried against this database,
and for every pair of genomes a result file is produced. If two genomes have GID-tags GID111,
and GID222 then both result file GID111_vs_GID222.txt and GID222_vs_GID111.txt will
be found in out.folder after the completion of this search. This reciprocal (two-way) search is
required because of the heuristics of BLAST.
The out.folder is scanned for already existing result files, and blastAllAll
never
overwrites an existing result file. If a file with the name GID111_vs_GID222.txt already exists in
the out.folder, this particular search is skipped. This makes it possible to run multiple jobs in
parallell, writing to the same out.folder. It also makes it possible to add new genomes, and only
BLAST the new combinations without repeating previous comparisons.
This search can be slow if the genomes contain many proteins and it scales quadratically in the number of
input files. It is best suited for the study of a smaller number of genomes (less than say 100). By
starting multiple R sessions, you can speed up the search by running blastAllAll
from each R
session, using the same out.folder but different integers for the job
option. If you are
using a computing cluster you can also increase the number of CPUs by increasing threads
.
The result files are text files, and can be read into R using readBlastTable
, but more
commonly they are used as input to bDist
to compute distances between sequences for subsequent
clustering.