FindChimeras(dbFile, tblName = "Seqs", identifier = "", dbFileReference, tblNameReference = "Seqs", batchSize = 100, minNumFragments = 20000, tb.width = 5, multiplier = 20, minLength = 70, minCoverage = 0.6, overlap = 100, minSuspectFragments = 6, showPercentCoverage = FALSE, add2tbl = FALSE, maxGroupSize = -1, minGroupSize = 100, excludeIDs = NULL, processors = 1, verbose = TRUE)
identifier
(s) to exclude from database searches, or NULL
(the default) to not exclude any.
NULL
to automatically detect and use all available processors.
data.frame
containing only the sequences that meet the specifications for being chimeric. The chimera column contains information on the chimeric region and to which group it belongs. The row.names
of the data.frame
correspond to those of the sequences in the dbFile
.
FindChimeras
works by finding suspect fragments that are uncommon in the group where the sequence belongs, but very common in another group where the sequence does not belong. Each sequence in the dbFile
is tiled into short sequence segments called fragments. If the fragments are infrequent in their respective group in the dbFileReference
then they are considered suspect. If enough suspect fragments from a sequence meet the specified constraints then the sequence is flagged as a chimera.The default parameters are optimized for full-length 16S sequences (> 1,000 nucleotides). Shorter 16S sequences require two parameters that are different than the defaults: minLength = 40
, and minSuspectFragments = 2
.
Groups are determined by the identifier present in each database. For this reason, the groups in the dbFile
should exist in the groups of the dbFileReference
. The reference database is assumed to contain many sequences of only good quality.
If a reference database is not present then it is feasible to create a reference database by using the input database as the reference database. Removing chimeras from the reference database and then iteratively repeating the process can result in a clean reference database.
For non-16S sequences it may be necessary to optimize the parameters for the particular sequences. The simplest way to perform an optimization is to experiment with different input parameters on artificial chimeras such as those created using CreateChimeras
. Adjusting input parameters until the maximum number of artificial chimeras are identified is the easiest way to determine new defaults.
CreateChimeras
, Add2DB
db <- system.file("extdata", "Bacteria_175seqs.sqlite", package="DECIPHER")
# It is necessary to set dbFileReference to the file path of the
# 16S reference database available from DECIPHER.cee.wisc.edu
chimeras <- FindChimeras(db, dbFileReference=db)
Run the code above in your browser using DataLab