Learn R Programming

micropan (version 2.1)

hmmerScan: Scanning a profile Hidden Markov Model database

Description

Scanning FASTA formatted protein files against a database of pHMMs using the HMMER3 software.

Usage

hmmerScan(in.files, dbase, out.folder, threads = 0, verbose = TRUE)

Arguments

in.files

A character vector of file names.

dbase

The full path-name of the database to scan (text).

out.folder

The name of the folder to put the result files.

threads

Number of CPU's to use.

verbose

Logical indicating if textual output should be given to monitor the progress.

Value

This function produces files in the folder specified by out.folder. Existing files are never overwritten by hmmerScan, if you want to re-compute something, delete the corresponding result files first.

Details

The HMMER3 software is purpose-made for handling profile Hidden Markov Models (pHMM) describing patterns in biological sequences (Eddy, 2008). This function will make calls to the HMMER3 software to scan FASTA files of proteins against a pHMM database.

The files named in in.files must contain FASTA formatted protein sequences. These files should be prepared by panPrep to make certain each sequence, as well as the file name, has a GID-tag identifying their genome. The database named in db must be a HMMER3 formatted database. It is typically the Pfam-A database, but you can also make your own HMMER3 databases, see the HMMER3 documentation for help.

hmmerScan will query every input file against the named database. The database contains profile Hidden Markov Models describing position specific sequence patterns. Each sequence in every input file is scanned to see if some of the patterns can be matched to some degree. Each input file results in an output file with the same GID-tag in the name. The result files give tabular output, and are plain text files. See readHmmer for how to read the results into R.

Scanning large databases like Pfam-A takes time, usually several minutes per genome. The scan is set up to use only 1 cpu per scan by default. By increasing threads you can utilize multiple CPUs, typically on a computing cluster. Our experience is that from a multi-core laptop it is better to start this function in default mode from mutliple R-sessions. This function will not overwrite an existing result file, and multiple parallel sessions can write results to the same folder.

References

Eddy, S.R. (2008). A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation. PLoS Computational Biology, 4(5).

See Also

panPrep, readHmmer.

Examples

Run this code
# NOT RUN {
# This example require the external software HMMER
# Using example files in this package
pf <- file.path(path.package("micropan"), "extdata", "xmpl_GID1.faa.xz")
dbf <- file.path(path.package("micropan"), "extdata",
                 str_c("microfam.hmm", c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz")))

# We need to uncompress them first...
prot.file <- tempfile(pattern = "GID1.faa", fileext=".xz")
ok <- file.copy(from = pf, to = prot.file)
prot.file <- xzuncompress(prot.file)
db.files <- str_c(tempfile(), c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz"))
ok <- file.copy(from = dbf, to = db.files)
db.files <- unlist(lapply(db.files, xzuncompress))
db.name <- str_remove(db.files[1], "\\.[a-z0-9]+$")

# Scanning the FASTA file against microfam.hmm...
hmmerScan(in.files = prot.file, dbase = db.name, out.folder = ".")

# Reading results
hmm.file <- file.path(".", str_c("GID1_vs_", basename(db.name), ".txt"))
hmm.tbl <- readHmmer(hmm.file)

# ...and cleaning...
ok <- file.remove(prot.file)
ok <- file.remove(str_remove(db.files, ".xz"))
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab