triplex.search: Search intramolecular triplex-forming sequences in DNA

Description

The triplex.search function identifies potential intramolecular triplex-forming sequences in DNA.

Usage

triplex.search( dna,  type        = 0:7, min_score   = 15, p_value     = 0.05, min_len     = 6, max_len     = 25, min_loop    = 3, max_loop    = 10, seq_type    = 'eukaryotic', score_table = 'default', group_table = 'default', lambda_par  = 'default', lambda_apar = 'default', mu_par      = 'default', mu_apar     = 'default', rn_par      = 'default', rn_apar     = 'default', dtwist_pen  = 'default', ins_pen     = 'default', iso_pen     = 'default', iso_bonus   = 'default', mis_pen     = 'default')

Arguments

dna

A DNAString object.

type

Vector of triplex types (0..7) to be searched for.

min_score

Minimal score treshold.

p_value

Acceptable P-value.

min_len

Minimal triplex length.

max_len

Maximal triplex length.

min_loop

Minimal triplex loop length. Can not be lower than one.

max_loop

Maximal triplex loop length.

seq_type

Type of input sequence. Possible options: prokaryotic, eukaryotic.

score_table

Scoring table for parallel and antiparallel triplex types. Default is the same as triplex.score.table output. Before changing this option, please read triplex.score.table help carefully.

group_table

Isomorphic group table for parallel and antiparallel triplex types. Default is the same as triplex.group.table output. Before changing this option, please read triplex.group.table help carefully.

lambda_par

Lambda for parallel triplex types 0,1,2,3. Default for prokaryotic sequence is 0.8892, for eukaryotic 0.8433.

lambda_apar

Lambda for antiparallel triplex types 4,5,6,7. Default for prokaryotic sequence is 0.8092, for eukaryotic 0.6910.

mu_par

Mu for parallel triplex types 0,1,2,3. Default for prokaryotic sequence is 7.4805, for eukaryotic 0.8433.

mu_apar

Mu for antiparallel triplex types 4,5,6,7. Default for prokaryotic sequence is 7.6569, for eukaryotic 7.9611.

rn_par

Hit ratio (reported hits to sequence length) for parallel triplex types 0,1,2,3. Default for prokaryotic sequence is 0.0406, for eukaryotic 0.0304.

rn_apar

Hit ratio (reported hits to sequence length) for antiparallel triplex types 4,5,6,7. Default for prokaryotic sequence is 0.0273, for eukaryotic 0.0405.

dtwist_pen

Dtwist penalization, default is 7.

ins_pen

Insertion penalization, default is 9.

iso_pen

Isomorphic group change penalization, default is 5.

iso_bonus

Isomorphic group stay bonus, default is 0.

mis_pen

Mismatch penalization, default is 7.

Value

Instance of TriplexViews object based on XStringViews class.

Details

The triplex.search function identifies potential intramolecular triplex-forming sequences in DNA sequence represented as a DNAString object.

Based on triplex position (forward or reverse strand) and its third strand orientation, up to 8 types of triplexes are distinguished by the function (see the following figure). By default, the function detects all 8 types, however this behavior can be changed by setting the type parameter to any value or a subset of values in the range 0 to 7.

types.pngFigure 1: Triplex types

Detected triplexes are returned as instances of the TriplexViews class, which represents the basic container for storing a set of views on the same input sequence similarly to the XStringViews object (in fact TriplexViews only extends the XStringViews class with a number of displayed columns). Each triplex view is defined by start and end locations, width, score, P-value, number of insertions, type, strand, loop start and loop end. Please note, that the strand orientation depends on triplex type only. The triplex.search function assumes that the input DNA sequence represents the forward strand.

Basic requirements for the shape or length of detected triplexes can be defined using four parameters: min_len, max_len, min_loop and max_loop. While min_len and max_len specify the length of the triplex stem composed of individual triplets, min_loop and max_loop parameters define the range of lengths for the unpaired loop at the top of the triplex. A graphical representation of these parameters is shown in the following figure. Please note, these parameters also impact the overall computation time. For longer triplexes, larger space has to be explored and thus more computation time is consumed.

triplex.pngFigure 2: Triplex scheme

The quality of each triplex is represented by its score value. A higher score value represents a higher-quality triplex. This quality is decreased by several types of imperfections at the level of triplets, such as character mismatch, insertion, deletion, isomorphic group change etc. Penalization constants for these imperfections can be setup using the following parameters: mis_pen, ins_pen, iso_pen, iso_bonus and dtwist_pen. Detailed information about the scoring function and penalization parameters can be found in (Lexa et al., 2011). It is highly recommended to see (Lexa et al., 2011) prior to changing any penalization parameters.

The triplex.search function can output a large list containing tens of thousands of potential triplexes. The size of these results can be reduced using two filtration mechanisms: (1) by specifying the minimal acceptable score value using the min_score parameter or (2) by specifying the maximum acceptable P-value of results using the p_value parameter. The P-value represents the probability of occurrence of detected triplexes in random sequence. By default, only triplexes with P-value equal or less than 0.05 are reported. Calculation of P-value depends on two extreme value distribution parameters lambda and mi. By default, these parameters are set up for searching in human genome sequences. It is highly recommended to see (Lexa et al., 2011) prior to changing either of the lambda and mi parameters.

References

Lexa, M., Martinek, T., Burgetova, I., Kopecek, D., Brazdova, M.: A dynamic programming algorithm for identification of triplex-forming sequences, In: Bioinformatics, Vol. 27, No. 18, 2011, Oxford, GB, p. 2510-2517, ISSN 1367-4803

Examples

Run this code

# GAA triplet repeats involved in Friedreichs's ataxia
seq <- DNAString("GAAGAAGAAGAAGAAGAAGAAGAAGAAGAA")

# Search specific triplex types (see details section)
triplex.search(seq, type=c(2,3), min_score=10, p_value=1)

# Search all triplex types
t <- triplex.search(seq, min_score=10, p_value=1)

# Sort triplexes by score
t[order(score(t), decreasing=TRUE)]

Run the code above in your browser using DataLab