A prokaryotic Open Reading Frame (ORF) is defined as a subsequence starting with a start-codon
(ATG, GTG or TTG), followed by an integer number of triplets (codons), and ending with a stop-codon (TAA,
TGA or TAG). This function will locate all ORFs in a genome.
The argument genome
will typically have several sequences (chromosomes/plasmids/scaffolds/contigs).
It is vital that the first token (characters before first space) of every genome$Header
is
unique, since this will be used to identify these genome sequences in the output.
Note that for any given stop-codon there are usually multiple start-codons in the same reading
frame. This function will return all, i.e. the same stop position may appear multiple times. If
you want ORFs with the most upstream start-codon only (LORFs), then filter the output from this function
with lorfs
.
By default the genome sequences are assumed to be linear, i.e. contigs or other incomplete fragments
of a genome. In such cases there will usually be some truncated ORFs at each end, i.e. ORFs where either
the start- or the stop-codon is lacking. In the gff.table
returned by this function this is marked in the
Attributes column. The texts "Truncated=10" or "Truncated=01" indicates truncated at
the Start or End, respectively.
If the supplied genome
is a completed genome, with
circular chromosome/plasmids, set the flag circular=TRUE
and no truncated ORFs will be listed.
In cases where an ORF runs across the origin of a circular genome sequences, the Stop coordinate will be
larger than the length of the genome sequence. This is in line with the specifications of the GFF3 format, where
a Start cannot be larger than the corresponding End.