A prokaryotic Open Reading Frame (ORF) is defined as a sub-sequence
starting with a start-codon (ATG, GTG or TTG), followed by an integer number
of triplets (codons), and ending with a stop-codon (TAA, TGA or TAG, unless
trans.tab = 4
, see below). This function will locate all such ORFs in
a genome.
The argument genome
is a fasta object, i.e. a table with columns
Header and Sequence, and will typically have several sequences
(chromosomes/plasmids/scaffolds/contigs). It is vital that the first
token (characters before first space) of every Header is
unique, since this will be used to identify these genome sequences in the
output.
By default the genome sequences are assumed to be linear, i.e. contigs or
other incomplete fragments of a genome. In such cases there will usually be
some truncated ORFs at each end, i.e. ORFs where either the start- or the
stop-codon is lacking. In the orf.table
returned by this function this
is marked in the Attributes column. The texts "Truncated=10" or
"Truncated=01" indicates truncated at the beginning or end of the genomic
sequence, respectively. If the supplied genome
is a completed genome,
with circular chromosome/plasmids, set the flag circular = TRUE
and no
truncated ORFs will be listed. In cases where an ORF runs across the origin
of a circular genome sequences, the stop coordinate will be larger than the
length of the genome sequence. This is in line with the specifications of
the GFF3 format, where a Start cannot be larger than the
corresponding End.
An alternative translation table may be specified, and as of now the only
alternative implemented is table 4. This means codon TGA is no longer a stop,
but codes for Tryptophan. This coding is used by some bacteria
(e.g. under the orders Entomoplasmatales and Mycoplasmatales).
Note that for any given stop-codon there are usually multiple start-codons
in the same reading frame. This function will return all such nested ORFs,
i.e. the same stop position may appear multiple times. If you want ORFs with
the most upstream start-codon only (LORFs), then filter the output from this
function with lorfs
.