indel_check: Check if a coi5p sequence likely contains an error.

Description

Check if a coi5p sequence likely contains an error.

Usage

indel_check(x, ...)
# S3 method for coi5p
indel_check(x, ..., indel_threshold = -358.88, aa_PHMM = aa_coi_PHMM)

Value

An object of class "coi5p"

Arguments

x: A coi5p class object for which frame() and translate() have been run.
...: Additional arguments to be passed between methods.
indel_threshold: The log likelihood threshold used to assess whether or not sequences are likely to contain an indel. Default is -358.88. Values lower than this will be classified as likely to contain an indel and values higher will be classified as not likely to contain an indel.
aa_PHMM: The profile hidden Markov model against which the translated amino acid sequence should be compared. Default is the full COI-5P amino acid PHMM (aa_coi_PHMM).

Details

The indel check function analyzes the framed and translated DNA sequences in two ways in order to allow users to make an informed decision about whether or not a DNA sequence contains a frameshift error. This test is designed to detect insertion or deletion errors resulting from technical errors in DNA sequencing, but can in some instances identify biological contaminants (i.e. if the contaminant sequence uses a different genetic code than the target, or if the contaminants are things such as pseudogenes that possess sequences that are highly divergent from animal COI-5P sequences).

The two tests performed are: (1) a query for stop codons in the amino acid sequence and (2) an evaluation of the log likelihood value resulting from the comparison of the framed coi5p amino acid sequence against the COI-5P amino acid PHMM. The default likelihood value for identifying a sequence is likely erroneous is -358.88. Sequences with likelihood values lower than this will receive an indel_likely value of TRUE. The threshold of -358.88 was experimentally determined to be the optimal likelihood threshold for separating of full-length sequences with and without errors when the censored translation option is used. Sequences will have higher likelihood values when a specific genetic code is used. Sequences will have lower likelihood values when they are not complete barcode sequences (i.e. <500bp in length). For these reasons the likelihood threshold is not a specific value but a parameter that can be altered based on the type of translation and length of the sequences. Below are experimentally determined suggested values for different size and translation table combinations.

Short barcode sequences, known genetic code: indel_threshold = -354.44

Short barcode sequences, unknown genetic code: indel_threshold = -440.24

Full length barcode sequences, known genetic code: indel_threshold = -246.20

Full length barcode sequences, unknown genetic code: indel_threshold = -358.88

Source: Nugent et al. 2019 (doi: https://doi.org/10.1101/2019.12.12.865014).

Examples

Run this code

#previously run functions:
dat = coi5p(example_nt_string)
dat = frame(dat)
dat = translate(dat)
#current function
dat = indel_check(dat)
#with custom indel threshold
dat = indel_check(dat, indel_threshold = -400)
#additional components in output coi5p object:
dat$stop_codons #Boolean - Indicates if there are stop codons in the amino acid sequence.
dat$indel_likely #Boolean - Indicates if the likelihood score below the specified indel_threshold.
dat$aaScore #view the amino acid log likelihood score