Learn R Programming

micropan (version 1.2)

findOrfs: Finding ORFs in genomes

Description

Finds all ORFs in prokaryotic genome sequences.

Usage

findOrfs(genome, circular = F)

Arguments

genome

A Fasta object with the genome sequence(s).

circular

Logical indicating if the genome sequences are completed, circular sequences.

Value

This function returns a gff.table, which is simply a data.frame with columns adhering to the format specified by the GFF3 format, see readGFF. If you want to retrieve the ORF sequences, use gff2fasta.

Details

A prokaryotic Open Reading Frame (ORF) is defined as a subsequence starting with a start-codon (ATG, GTG or TTG), followed by an integer number of triplets (codons), and ending with a stop-codon (TAA, TGA or TAG). This function will locate all ORFs in a genome.

The argument genome will typically have several sequences (chromosomes/plasmids/scaffolds/contigs). It is vital that the first token (characters before first space) of every genome$Header is unique, since this will be used to identify these genome sequences in the output.

Note that for any given stop-codon there are usually multiple start-codons in the same reading frame. This function will return all, i.e. the same stop position may appear multiple times. If you want ORFs with the most upstream start-codon only (LORFs), then filter the output from this function with lorfs.

By default the genome sequences are assumed to be linear, i.e. contigs or other incomplete fragments of a genome. In such cases there will usually be some truncated ORFs at each end, i.e. ORFs where either the start- or the stop-codon is lacking. In the gff.table returned by this function this is marked in the Attributes column. The texts "Truncated=10" or "Truncated=01" indicates truncated at the Start or End, respectively.

If the supplied genome is a completed genome, with circular chromosome/plasmids, set the flag circular=TRUE and no truncated ORFs will be listed. In cases where an ORF runs across the origin of a circular genome sequences, the Stop coordinate will be larger than the length of the genome sequence. This is in line with the specifications of the GFF3 format, where a Start cannot be larger than the corresponding End.

See Also

readGFF, gff2fasta, lorfs.

Examples

Run this code
# NOT RUN {
# Using a genome file in this package
xpth <- file.path(path.package("micropan"),"extdata")
genome.file <- file.path(xpth,"Example_genome.fasta.xz")

# We need to uncompress them first...
tf <- tempfile(fileext=".xz")
s <- file.copy(genome.file,tf)
tf <- xzuncompress(tf)

# Reading into R and finding orfs
genome <- readFasta(tf)
orf.table <- findOrfs(genome)

# Computing ORF-lengths
orf.lengths <- orfLength(orf.table)
barplot(table(orf.lengths[orf.lengths>1]))

# Filtering to retrieve the LORFs only
lorf.table <- lorfs(orf.table)
lorf.lengths <- orfLength(lorf.table)
barplot(table(lorf.lengths[lorf.lengths>1]))

# ...and cleaning...
s <- file.remove(tf)

# }

Run the code above in your browser using DataLab