FindChimeras: Find Chimeras in a Sequence Database

Description

Finds chimeras present in a database of sequences. Makes use of a reference database of (presumed to be) good quality sequences.

Usage

FindChimeras(dbFile, tblName = "Seqs", identifier = "", dbFileReference, tblNameReference = "Seqs", batchSize = 100, minNumFragments = 20000, tb.width = 5, multiplier = 20, minLength = 70, minCoverage = 0.6, overlap = 100, minSuspectFragments = 6, showPercentCoverage = FALSE, add2tbl = FALSE, maxGroupSize = -1, minGroupSize = 100, excludeIDs = NULL, processors = 1, verbose = TRUE)

Arguments

dbFile

A SQLite connection object or a character string specifying the path to the database file to be checked for chimeric sequences.

tblName

Character string specifying the table in which to check for chimeras.

identifier

Optional character string used to narrow the search results to those matching a specific identifier. If "" then all identifiers are selected.

dbFileReference

A SQLite connection object or a character string specifying the path to the reference database file of (presumed to be) good quality sequences. A 16S reference database is available from DECIPHER.cee.wisc.edu.

tblNameReference

Character string specifying the table with reference sequences.

batchSize

Number sequences to tile with fragments at a time.

minNumFragments

Number of suspect fragments to accumulate before searching through other groups.

tb.width

A single integer [1..14] giving the number of nucleotides at the start of each fragment that are part of the trusted band.

multiplier

A single integer specifying the multiple of fragments found out-of-group greater than fragments found in-group in order to consider a sequence a chimera.

minLength

Minimum length of a chimeric region in order to be considered as a chimera.

minCoverage

Minimum fraction of coverage necessary in a chimeric region.

overlap

Number of nucleotides at the end of the sequence that the chimeric region must overlap in order to be considered a chimera.

minSuspectFragments

Minimum number of suspect fragments belonging to another group required to consider a sequence a chimera.

showPercentCoverage

Logical indicating whether to list the percent coverage of suspect fragments in each chimeric region in the output.

add2tbl

Logical or a character string specifying the table name in which to add the result.

maxGroupSize

Maximum number of sequences searched in a group. A value of less than 0 means the search is unlimited.

minGroupSize

The minimum number of sequences in a group to be considered as part of the search for chimeras. May need to be set to a small value for reference database with mostly small groups.

excludeIDs

Optional character vector of identifier(s) to exclude from database searches, or NULL (the default) to not exclude any.

processors

The number of processors to use, or NULL to automatically detect and use all available processors.

verbose

Logical indicating whether to display progress.

Value

A data.frame containing only the sequences that meet the specifications for being chimeric. The chimera column contains information on the chimeric region and to which group it belongs. The row.names of the data.frame correspond to those of the sequences in the dbFile.

Details

FindChimeras works by finding suspect fragments that are uncommon in the group where the sequence belongs, but very common in another group where the sequence does not belong. Each sequence in the dbFile is tiled into short sequence segments called fragments. If the fragments are infrequent in their respective group in the dbFileReference then they are considered suspect. If enough suspect fragments from a sequence meet the specified constraints then the sequence is flagged as a chimera.

The default parameters are optimized for full-length 16S sequences (> 1,000 nucleotides). Shorter 16S sequences require two parameters that are different than the defaults: minLength = 40, and minSuspectFragments = 2.

Groups are determined by the identifier present in each database. For this reason, the groups in the dbFile should exist in the groups of the dbFileReference. The reference database is assumed to contain many sequences of only good quality.

If a reference database is not present then it is feasible to create a reference database by using the input database as the reference database. Removing chimeras from the reference database and then iteratively repeating the process can result in a clean reference database.

For non-16S sequences it may be necessary to optimize the parameters for the particular sequences. The simplest way to perform an optimization is to experiment with different input parameters on artificial chimeras such as those created using CreateChimeras. Adjusting input parameters until the maximum number of artificial chimeras are identified is the easiest way to determine new defaults.

References

ES Wright et al. (2012) "DECIPHER: A Search-Based Approach to Chimera Identification for 16S rRNA Sequences." Applied and Environmental Microbiology, doi:10.1128/AEM.06516-11.

Examples

Run this code

db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
# It is necessary to set dbFileReference to the file path of the
# 16S reference database available from DECIPHER.cee.wisc.edu
chimeras <- FindChimeras(db, dbFileReference=db)

Run the code above in your browser using DataLab