CorrectFrameshifts: Corrects Frameshift Errors In Protein Coding Sequences

Description

Corrects the reading frame to mitigate the impact of frameshift errors caused by insertions or deletions in unaligned nucleotide sequences.

Usage

CorrectFrameshifts(myXStringSet, myAAStringSet, type = "indels", acceptDistance = 0.01, rejectDistance = 0.60, maxComparisons = 10, gapOpening = -13, gapExtension = -1, frameShift = -15, geneticCode = GENETIC_CODE, substitutionMatrix = "MIQS", verbose = TRUE, processors = 1)

Arguments

myXStringSet

A DNAStringSet or RNAStringSet of unaligned protein coding sequences in 5' to 3' orientation.

myAAStringSet

An AAStringSet of reference sequences having accurate translations.

type

Character string indicating the type of result desired. This should be (an abbreviation of) one of "indels", "sequences", or "both". (See details section below.)

acceptDistance

Numeric giving the maximum distance from a reference sequence that is acceptable to skip any remaining comparisons.

rejectDistance

Numeric giving the maximum distance from a reference sequence that is allowed when correcting frameshifts. Sequences in myXStringSet that are greater than rejectDistance from the nearest reference sequence will only have their length trimmed from the 3'-end to a multiple of three nucleotides without any frameshift correction.

maxComparisons

The number of reference comparisons to make before stopping the search for a closer reference sequence.

gapOpening

Numeric giving the cost for opening a gap between the query and reference sequences.

gapExtension

Numeric giving the cost for extending an open gap between the query and reference sequences.

frameShift

Numeric giving the cost for shifting between frames of the query sequence.

geneticCode

Named character vector in the same format as GENETIC_CODE (the default), which represents the standard genetic code.

substitutionMatrix

Either a substitution matrix representing the substitution scores for matching two amino acids or the name of the amino acid substitution matrix. The latter may be one of the following: ``BLOSUM45'', ``BLOSUM50'', ``BLOSUM62'', ``BLOSUM80'', ``BLOSUM100'', ``PAM30'', ``PAM40'', ``PAM70'', ``PAM120'', ``PAM250'', or ``MIQS'' (the default).

verbose

Logical indicating whether to display progress.

processors

The number of processors to use, or NULL to automatically detect and use all available processors.

Value

insertions: Approximate positions of inserted nucleotides, which could be removed to correct the reading frame, or excess nucleotides at the 3'-end that make the length longer than a multiple of three.
deletions: Approximate positions of deleted nucleotides, which could be added back to correct the reading frame.
distance: The amino acid distance from the nearest reference sequence, between 0 and 1.
index: The integer index of the reference sequence that was used for frame correction, or 0 if no reference sequence was within rejectDistance.

Details

Accurate translation of protein coding sequences can be greatly disrupted by one or two nucleotide phase shifts that occasionally occur during DNA sequencing. These frameshift errors can potentially be corrected through comparison with other unshifted protein sequences. This function uses a set of reference amino acid sequences (AAStringSet) to find and correct frameshift errors in a set of nucleotide sequences (myXStringSet). First, three frame translation of the nucleotide sequences is performed, and the nearest reference sequence is selected. Then the optimal reading frame at each position is determined based on a variation of the Guan & Uberbacher method. Putative insertions and/or deletions (indels) are returned in the result, typically with close proximity to the true indel locations.

If type is "sequences" or "both", then frameshifts are corrected by adding N's and/or removing nucleotides. Note that this changes the nucleotide sequence, and the new sequence often has minor errors because the exact location of the indel(s) cannot be determined. However, the original frameshifts that disrupted the entire downstream sequence are reduced to local perturbations. All of the returned nucleotide sequences will have a reading frame starting from the first position. This allows direct translation, and in practice works well if there is a similar reference myAAStringSet with the correct reading frame. Hence it is more important that myAAStringSet contain a wide variety of sequences than it is that it contain a lot of sequences.

Multiple inputs control the time required for frameshift correction. The number of sequences in the reference set (myAAStringSet) will affect the speed of the first search for similar sequences. Assessing frameshifts in the second step requires order N*M time, where N and M are the lengths of the query (myXStringSet) and reference sequences. Two parameters control the number of assessments that are made for each sequence: (1) maxComparisons determines the maximum number of reference sequences to compare to each query sequence, and (2) acceptDist defines the maximum distance between a query and reference that is acceptable before continuing to the next query sequence. A lower value for maxComparisons or a higher value for acceptDist will accelerate frameshift correction, potentially at the expense of some accuracy.

References

Guan, X., & Uberbacher, E. C. (1996). Alignments of DNA and protein sequences containing frameshift errors. Computer Applications in the Biosciences : CABIOS, 12(1), 31-40.

Examples

Run this code

fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)

# introduce artificial indels
n_ins <- 2 # insertions per sequence
shifted <- replaceAt(dna,
	lapply(width(dna),
		sample,
		n_ins),
	sample(DNA_BASES,
		n_ins,
		replace=TRUE))
n_dels <- 1 # deletions per sequence
shifted <- replaceAt(shifted,
	RangesList(lapply(width(shifted),
		function(x) {
			IRanges(sample(x,
					n_dels),
				width=1)
		})))

# to make frameshift correction more challenging,
# only supply 20 reference amino acid sequences
s <- sample(length(dna), 20)
x <- CorrectFrameshifts(shifted,
	translate(dna[s]),
	type="both")

# there was a wide range of distances
# to the nearest reference sequence
quantile(unlist(lapply(x[[1]], `[`, "distance")))

# none of the sequences were > rejectDistance
# from the nearest reference sequence
length(which(unlist(lapply(x[[1]], `[`, "index"))==0))

# the number of indels was generally correct
table(unlist(lapply(x[[1]], function(x) {
	length(x$insertions)})))/length(shifted)
table(unlist(lapply(x[[1]], function(x) {
	length(x$deletions)})))/length(shifted)

# align and display the translations
AA <- AlignTranslation(x$sequences,
	readingFrame=1,
	asAAStringSet=TRUE)
BrowseSeqs(AA)

Run the code above in your browser using DataLab