findOrfs: Finding ORFs in genomes

Description

Finds all ORFs in prokaryotic genome sequences.

Usage

findOrfs(genome, circular = F, trans.tab = 11)

Value

This function returns an orf.table, which is simply a tibble with columns adhering to the GFF3 format specifications (a gff.table), see readGFF. If you want to retrieve the actual ORF sequences, use gff2fasta.

Arguments

genome: A fasta object (tibble) with the genome sequence(s).
circular: Logical indicating if the genome sequences are completed, circular sequences.
trans.tab: Translation table.

Author

Lars Snipen and Kristian Hovde Liland.

Details

A prokaryotic Open Reading Frame (ORF) is defined as a sub-sequence starting with a start-codon (ATG, GTG or TTG), followed by an integer number of triplets (codons), and ending with a stop-codon (TAA, TGA or TAG, unless trans.tab = 4, see below). This function will locate all such ORFs in a genome.

The argument genome is a fasta object, i.e. a table with columns Header and Sequence, and will typically have several sequences (chromosomes/plasmids/scaffolds/contigs). It is vital that the first token (characters before first space) of every Header is unique, since this will be used to identify these genome sequences in the output.

By default the genome sequences are assumed to be linear, i.e. contigs or other incomplete fragments of a genome. In such cases there will usually be some truncated ORFs at each end, i.e. ORFs where either the start- or the stop-codon is lacking. In the orf.table returned by this function this is marked in the Attributes column. The texts "Truncated=10" or "Truncated=01" indicates truncated at the beginning or end of the genomic sequence, respectively. If the supplied genome is a completed genome, with circular chromosome/plasmids, set the flag circular = TRUE and no truncated ORFs will be listed. In cases where an ORF runs across the origin of a circular genome sequences, the stop coordinate will be larger than the length of the genome sequence. This is in line with the specifications of the GFF3 format, where a Start cannot be larger than the corresponding End.

An alternative translation table may be specified, and as of now the only alternative implemented is table 4. This means codon TGA is no longer a stop, but codes for Tryptophan. This coding is used by some bacteria (e.g. under the orders Entomoplasmatales and Mycoplasmatales).

Note that for any given stop-codon there are usually multiple start-codons in the same reading frame. This function will return all such nested ORFs, i.e. the same stop position may appear multiple times. If you want ORFs with the most upstream start-codon only (LORFs), then filter the output from this function with lorfs.

Examples

Run this code

# Using a genome file in this package
genome.file <- file.path(path.package("microseq"),"extdata","small.fna")

# Reading genome and finding orfs
genome <- readFasta(genome.file)
orf.tbl <- findOrfs(genome)

# Pipeline for finding LORFs of minimum length 100 amino acids
# and collecting their sequences from the genome
findOrfs(genome) %>% 
 lorfs() %>% 
 filter(orfLength(., aa = TRUE) > 100) %>% 
 gff2fasta(genome) -> lorf.tbl

Run the code above in your browser using DataLab