Learn R Programming

microseq (version 2.1.6)

findGenes: Finding coding genes

Description

Finding coding genes in genomic DNA using the Prodigal software.

Usage

findGenes(
  genome,
  prodigal.exe = "prodigal",
  faa.file = "",
  ffn.file = "",
  proc = "single",
  trans.tab = 11,
  mask.N = FALSE,
  bypass.SD = FALSE
)

Value

A GFF-table (see readGFF for details) with one row for each detected coding gene.

Arguments

genome

A table with columns Header and Sequence, containing the genome sequence(s).

prodigal.exe

Command to run the external software prodigal on the system (text).

faa.file

If provided, prodigal will output all proteins to this fasta-file (text).

ffn.file

If provided, prodigal will output all DNA sequences to this fasta-file (text).

proc

Either "single" or "meta", see below.

trans.tab

Either 11 or 4 (see below).

mask.N

Turn on masking of N's (logical)

bypass.SD

Bypass Shine-Dalgarno filter (logical)

Author

Lars Snipen and Kristian Hovde Liland.

Details

The external software Prodigal is used to scan through a prokaryotic genome to detect the protein coding genes. The text in prodigal.exe must contain the exact command to invoke barrnap on the system.

In addition to the standard output from this function, FASTA files with protein and/or DNA sequences may be produced directly by providing filenames in faa.file and ffn.file.

The input proc allows you to specify if the input data should be treated as a single genome (default) or as a metagenome. In the latter case the genome are (un-binned) contigs.

The translation table is by default 11 (the standard code), but table 4 should be used for Mycoplasma etc.

The mask.N will prevent genes having runs of N inside. The bypass.SD turn off the search for a Shine-Dalgarno motif.

See Also

readGFF, gff2fasta.

Examples

Run this code
if (FALSE) {
# This example requires the external prodigal software
# Using a genome file in this package.
genome.file <- file.path(path.package("microseq"),"extdata","small.fna")

# Searching for coding sequences, this is Mycoplasma (trans.tab = 4)
genome <- readFasta(genome.file)
gff.tbl <- findGenes(genome, trans.tab = 4)

# Retrieving the sequences
cds.tbl <- gff2fasta(gff.tbl, genome)

# You may use the pipe operator
library(ggplot2)
readFasta(genome.file) %>% 
  findGenes(trans.tab = 4) %>% 
  filter(Score >= 50) %>% 
  ggplot() +
  geom_histogram(aes(x = Score), bins = 25)
}

Run the code above in your browser using DataLab