A basic step in pangenomics and many other comparative studies is to cluster proteins into
groups or families. One commonly used approach is based on BLASTing. This function uses the
blast+ software available for free from NCBI (Camacho et al, 2009). More precisely, the blastp
algorithm with the BLOSUM45 scoring matrix and all composition based statistics turned off.
A vector listing FASTA files of protein sequences is given as input in prot.files. These files
must have the genome_id in the first token of every header, and in their filenames as well, i.e. all input
files should first be prepared by panPrep
to ensure this. Note that only protein sequences
are considered here. If your coding genes are stored as DNA, please translate them to protein prior to
using this function, see translate
.
In the first version of this package we used reciprocal BLASTing, i.e. we computed both genome A against
B and B against A. This may sometimes produce slightly different results, but in reality this is too
costly compared to its gain, and we now only make one of the above searches. This basically halves the
number of searches. This step is still very time consuming for larger number of genomes. Note that the
protein files are sorted by the genome_id (part of filename) inside this function. This is to ensure a
consistent ordering irrespective of how they are enterred.
For every pair of genomes a result file is produced. If two genomes have genome_id's GID111,
and GID222 then the result file GID222_vs_GID111.txt will
be found in out.folder after the completion of this search. The last of the two genome_id is always
the first in alphabetical order of the two.
The out.folder is scanned for already existing result files, and blastpAllAll
never
overwrites an existing result file. If a file with the name GID111_vs_GID222.txt already exists in
the out.folder, this particular search is skipped. This makes it possible to run multiple jobs in
parallell, writing to the same out.folder. It also makes it possible to add new genomes, and only
BLAST the new combinations without repeating previous comparisons.
This search can be slow if the genomes contain many proteins and it scales quadratically in the number of
input files. It is best suited for the study of a smaller number of genomes. By
starting multiple R sessions, you can speed up the search by running blastpAllAll
from each R
session, using the same out.folder but different integers for the job
option. At the same
time you may also want to start the BLASTing at different places in the file-list, by giving larger values
to the argument start.at
. This is 1 by default, i.e. the BLASTing starts at the first protein file.
If you are using a multicore computer you can also increase the number of CPUs by increasing threads
.
The result files are tab-separated text files, and can be read into R, but more
commonly they are used as input to bDist
to compute distances between sequences for subsequent
clustering.