This function will read a FASTA file and produce another, slightly modified, FASTA file
which is prepared for genome-wise comparisons using blastAllAll
, hmmerScan
or any other method.
The main purpose of panPrep
is to make certain every sequence is labeled with a tag
called a GID.tag (Genome IDentifier tag) identifying the genome. This text contains the text
“GID” followed by an integer. This integer can be any integer as long as it is unique to every
genome in the study. It can typically be the BioProject number or any other integer that is uniquely
related to a specific genome. If a genome has the text “GID12345” as identifier, then the
sequences in the file produced by panPrep
will have headerlines starting with
“GID12345_seq1”, “GID12345_seq2”, “GID12345_seq3”...etc. This makes it possible
to quickly identify which genome every sequence belongs to.
The GID.tag is also added to the file name specified in out.file. For this reason the
out.file must have a file extension containing letters only. By convention, we expect FASTA
files to have one of the extensions .fsa, .faa, .fa or .fasta.
panPrep
will also remove very short sequences (< 10 amino acids), removing stop codon
symbols (*), replacing alien characters with X and converting all sequences to upper-case.
If the input discard contains a regular expression, any sequences having a match to this in their
headerline are also removed. Example: If we use prodigal
to find proteins in a
genome, partially predicted genes will have the text partial=10 or partial=01 in their
headerline. Using discard="partial=01|partial=10" will remove these from the data set.