Learn R Programming

DECIPHER (version 2.0.2)

CorrectFrameshifts: Corrects Frameshift Errors In Protein Coding Sequences

Description

Corrects the reading frame to mitigate the impact of frameshift errors caused by insertions or deletions in unaligned nucleotide sequences.

Usage

CorrectFrameshifts(myXStringSet, myAAStringSet, type = "indels", acceptDistance = 0.01, rejectDistance = 0.60, maxComparisons = 10, gapOpening = -13, gapExtension = -1, frameShift = -15, geneticCode = GENETIC_CODE, substitutionMatrix = "MIQS", verbose = TRUE, processors = 1)

Arguments

myXStringSet
A DNAStringSet or RNAStringSet of unaligned protein coding sequences in 5' to 3' orientation.
myAAStringSet
An AAStringSet of reference sequences having accurate translations.
type
Character string indicating the type of result desired. This should be (an abbreviation of) one of "indels", "sequences", or "both". (See details section below.)
acceptDistance
Numeric giving the maximum distance from a reference sequence that is acceptable to skip any remaining comparisons.
rejectDistance
Numeric giving the maximum distance from a reference sequence that is allowed when correcting frameshifts. Sequences in myXStringSet that are greater than rejectDistance from the nearest reference sequence will only have their length trimmed from the 3'-end to a multiple of three nucleotides without any frameshift correction.
maxComparisons
The number of reference comparisons to make before stopping the search for a closer reference sequence.
gapOpening
Numeric giving the cost for opening a gap between the query and reference sequences.
gapExtension
Numeric giving the cost for extending an open gap between the query and reference sequences.
frameShift
Numeric giving the cost for shifting between frames of the query sequence.
geneticCode
Named character vector in the same format as GENETIC_CODE (the default), which represents the standard genetic code.
substitutionMatrix
Either a substitution matrix representing the substitution scores for matching two amino acids or the name of the amino acid substitution matrix. The latter may be one of the following: ``BLOSUM45'', ``BLOSUM50'', ``BLOSUM62'', ``BLOSUM80'', ``BLOSUM100'', ``PAM30'', ``PAM40'', ``PAM70'', ``PAM120'', ``PAM250'', or ``MIQS'' (the default).
verbose
Logical indicating whether to display progress.
processors
The number of processors to use, or NULL to automatically detect and use all available processors.

Value

If type is "indels" then the returned object is a list with the same length as myXStringSet. Each element is a list with four components:
insertions
Approximate positions of inserted nucleotides, which could be removed to correct the reading frame, or excess nucleotides at the 3'-end that make the length longer than a multiple of three.
deletions
Approximate positions of deleted nucleotides, which could be added back to correct the reading frame.
distance
The amino acid distance from the nearest reference sequence, between 0 and 1.
index
The integer index of the reference sequence that was used for frame correction, or 0 if no reference sequence was within rejectDistance.
Note that positions in insertions and deletions are sometimes repeated to indicate that the same position needs to be shifted successively more than once to correct the reading frame.If type is "sequences" then the returned object is an XStringSet of the same type as the input (myXStringSet). Nucleotides are added or deleted as necessary to correct for frameshifts. The returned sequences all have a reading frame starting from position 1, so that they can be translated directly.If type is "both" then the returned object is a list with two components: one for the "indels" and the other for the "sequences".

Details

Accurate translation of protein coding sequences can be greatly disrupted by one or two nucleotide phase shifts that occasionally occur during DNA sequencing. These frameshift errors can potentially be corrected through comparison with other unshifted protein sequences. This function uses a set of reference amino acid sequences (AAStringSet) to find and correct frameshift errors in a set of nucleotide sequences (myXStringSet). First, three frame translation of the nucleotide sequences is performed, and the nearest reference sequence is selected. Then the optimal reading frame at each position is determined based on a variation of the Guan & Uberbacher method. Putative insertions and/or deletions (indels) are returned in the result, typically with close proximity to the true indel locations.

If type is "sequences" or "both", then frameshifts are corrected by adding N's and/or removing nucleotides. Note that this changes the nucleotide sequence, and the new sequence often has minor errors because the exact location of the indel(s) cannot be determined. However, the original frameshifts that disrupted the entire downstream sequence are reduced to local perturbations. All of the returned nucleotide sequences will have a reading frame starting from the first position. This allows direct translation, and in practice works well if there is a similar reference myAAStringSet with the correct reading frame. Hence it is more important that myAAStringSet contain a wide variety of sequences than it is that it contain a lot of sequences.

Multiple inputs control the time required for frameshift correction. The number of sequences in the reference set (myAAStringSet) will affect the speed of the first search for similar sequences. Assessing frameshifts in the second step requires order N*M time, where N and M are the lengths of the query (myXStringSet) and reference sequences. Two parameters control the number of assessments that are made for each sequence: (1) maxComparisons determines the maximum number of reference sequences to compare to each query sequence, and (2) acceptDist defines the maximum distance between a query and reference that is acceptable before continuing to the next query sequence. A lower value for maxComparisons or a higher value for acceptDist will accelerate frameshift correction, potentially at the expense of some accuracy.

References

Guan, X., & Uberbacher, E. C. (1996). Alignments of DNA and protein sequences containing frameshift errors. Computer Applications in the Biosciences : CABIOS, 12(1), 31-40.

See Also

AlignTranslation, OrientNucleotides

Examples

Run this code
fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)

# introduce artificial indels
n_ins <- 2 # insertions per sequence
shifted <- replaceAt(dna,
	lapply(width(dna),
		sample,
		n_ins),
	sample(DNA_BASES,
		n_ins,
		replace=TRUE))
n_dels <- 1 # deletions per sequence
shifted <- replaceAt(shifted,
	RangesList(lapply(width(shifted),
		function(x) {
			IRanges(sample(x,
					n_dels),
				width=1)
		})))

# to make frameshift correction more challenging,
# only supply 20 reference amino acid sequences
s <- sample(length(dna), 20)
x <- CorrectFrameshifts(shifted,
	translate(dna[s]),
	type="both")

# there was a wide range of distances
# to the nearest reference sequence
quantile(unlist(lapply(x[[1]], `[`, "distance")))

# none of the sequences were > rejectDistance
# from the nearest reference sequence
length(which(unlist(lapply(x[[1]], `[`, "index"))==0))

# the number of indels was generally correct
table(unlist(lapply(x[[1]], function(x) {
	length(x$insertions)})))/length(shifted)
table(unlist(lapply(x[[1]], function(x) {
	length(x$deletions)})))/length(shifted)

# align and display the translations
AA <- AlignTranslation(x$sequences,
	readingFrame=1,
	asAAStringSet=TRUE)
BrowseSeqs(AA)

Run the code above in your browser using DataLab