words.pos: Positions of possibly degenerated motifs within sequences

Description

word.pos searches all the occurences of the motif pattern within the sequence text and returns their positions. This function is based on regexp allowing thus for complex motif searches. The main difference with gregexpr is that non disjoint matches are reported here.

Usage

words.pos(pattern, text, ignore.case = FALSE,
                      perl = TRUE, fixed = FALSE, useBytes = TRUE, ...)

Arguments

pattern

character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector.

text

a character vector where matches are sought.

ignore.case

if FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching.

perl

logical. Should perl-compatible regexps be used if available? Has priority over extended.

fixed

logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments.

useBytes

logical. If TRUE the matching is done byte-by-byte rather than character-by-character.

...

arguments passed to regexpr.

Value

a vector of positions for which the motif pattern was found in the sequence text.

Details

Default parameter values have been tuned for speed when working biological sequences.

References

citation("seqinr")

Examples

Run this code

myseq <- "tatagaga"
words.pos("t", myseq)   # Should be 1 3
words.pos("tag", myseq) # Should be 3
words.pos("ga", myseq)  # Should be 5 7
# How to specify ambiguous base ? Look for YpR motifs by
words.pos("[ct][ag]", myseq) # Should be 1 3
#
# Show the difference with gregexpr:
#
words.pos("toto", "totototo")           # 1 3 5 (three overlapping matches)
unlist(gregexpr("toto",  "totototo")) # 1 5    (two disjoint matches)

Run the code above in your browser using DataLab