
hasLetterAt
checks whether a sequence or set of sequences has the
specified letters at the specified positions.
neditAt
, isMatchingAt
and which.isMatchingAt
are
low-level matching functions that only look for matches at the specified
positions in the subject.
hasLetterAt(x, letter, at, fixed=TRUE)
## neditAt() and related utils:
neditAt(pattern, subject, at=1, with.indels=FALSE, fixed=TRUE)
neditStartingAt(pattern, subject, starting.at=1, with.indels=FALSE, fixed=TRUE)
neditEndingAt(pattern, subject, ending.at=1, with.indels=FALSE, fixed=TRUE)
## isMatchingAt() and related utils:
isMatchingAt(pattern, subject, at=1, max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE)
isMatchingStartingAt(pattern, subject, starting.at=1, max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE)
isMatchingEndingAt(pattern, subject, ending.at=1, max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE)
## which.isMatchingAt() and related utils:
which.isMatchingAt(pattern, subject, at=1, max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE, follow.index=FALSE, auto.reduce.pattern=FALSE)
which.isMatchingStartingAt(pattern, subject, starting.at=1, max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE, follow.index=FALSE, auto.reduce.pattern=FALSE)
which.isMatchingEndingAt(pattern, subject, ending.at=1, max.mismatch=0, min.mismatch=0, with.indels=FALSE, fixed=TRUE, follow.index=FALSE, auto.reduce.pattern=FALSE)
starting.at
and at
) or ending (for ending.at
) positions of the
pattern relatively to the subject.
With auto.reduce.pattern
(below), either a single integer or
a constant vector of length nchar(pattern)
(below), to which
the former is immediately converted. For the hasLetterAt
function, letter
and at
must have the same length.
auto.reduce.pattern
, below).
at
(or starting.at
, or ending.at
) argument.
More details below.
fixed
value other than the default (TRUE
) be used. If TRUE
(the default), an IUPAC ambiguity code in the pattern
can only match the same code in the subject, and vice versa.
If FALSE
, an IUPAC ambiguity code in the pattern can match
any letter in the subject that is associated with the code, and
vice versa.
See IUPAC_CODE_MAP
for more information about the
IUPAC Extended Genetic Alphabet.
fixed
can also be a character vector, a subset
of c("pattern", "subject")
.
fixed=c("pattern", "subject")
is equivalent to fixed=TRUE
(the default).
An empty vector is equivalent to fixed=FALSE
.
With fixed="subject"
, ambiguities in the pattern only
are interpreted as wildcards.
With fixed="pattern"
, ambiguities in the subject only
are interpreted as wildcards.
which.isMatchingAt
(and related utils) should be the first *value* in at
for
which a match occurred, or its *index* in at
(the default).
pattern
should be effectively shortened by 1 letter,
from its beginning for which.isMatchingStartingAt
and from
its end for which.isMatchingEndingAt
, for each successive
(at, max.mismatch)
"pair".
hasLetterAt
: A logical matrix with one row per element in x
and one column per letter/position to check. When a specified position
is invalid with respect to an element in x
then the corresponding
matrix element is set to NA.neditAt
: If subject
is an XString object, then
return an integer vector of the same length as at
.
If subject
is an XStringSet object, then return the
integer matrix with length(at)
rows and length(subject)
columns defined by:
sapply(unname(subject), function(x) neditAt(pattern, x, ...))
neditStartingAt
is identical to neditAt
except
that the at
argument is now called starting.at
.
neditEndingAt
is similar to neditAt
except that
the at
argument is now called ending.at
and must contain
the ending positions of the pattern relatively to the subject.isMatchingAt
: If subject
is an XString object,
then return the logical vector defined by:
min.mismatch <= neditat(...)="" <="max.mismatch" pre=""> Ifsubject
is an XStringSet object, then return the logical matrix withlength(at)
rows andlength(subject)
columns defined by:sapply(unname(subject), function(x) isMatchingAt(pattern, x, ...))isMatchingStartingAt
is identical toisMatchingAt
except that theat
argument is now calledstarting.at
.isMatchingEndingAt
is similar toisMatchingAt
except that theat
argument is now calledending.at
and must contain the ending positions of the pattern relatively to the subject.which.isMatchingAt
: The default behavior (follow.index=FALSE
) is as follow. Ifsubject
is an XString object, then return the single integer defined by:which(isMatchingAt(...))[1]Ifsubject
is an XStringSet object, then return the integer vector defined by:sapply(unname(subject), function(x) which.isMatchingAt(pattern, x, ...))Iffollow.index=TRUE
, then the returned value is defined by:at[which.isMatchingAt(..., follow.index=FALSE)]which.isMatchingStartingAt
is identical towhich.isMatchingAt
except that theat
argument is now calledstarting.at
.which.isMatchingEndingAt
is similar towhich.isMatchingAt
except that theat
argument is now calledending.at
and must contain the ending positions of the pattern relatively to the subject.
The neditAt
function implements these 2 distances.
If with.indels
is FALSE
(the default), then the first distance
is used i.e. neditAt
returns the "number of mismatching letters"
between the pattern P and the substring S' of S starting at the
positions specified in at
(note that neditAt
is vectorized
so a long vector of integers can be passed thru the at
argument).
If with.indels
is TRUE
, then the "edit distance" is
used: for each position specified in at
, P is compared to
all the substrings S' of S starting at this position and the smallest
distance is returned. Note that this distance is guaranteed to be reached
for a substring of length < 2*length(P) so, of course, in practice,
P only needs to be compared to a small number of substrings for every
starting position.
nucleotideFrequencyAt
,
matchPattern
,
matchPDict
,
matchLRPatterns
,
trimLRPatterns
,
IUPAC_CODE_MAP
,
XString-class,
align-utils
## ---------------------------------------------------------------------
## hasLetterAt()
## ---------------------------------------------------------------------
x <- DNAStringSet(c("AAACGT", "AACGT", "ACGT", "TAGGA"))
hasLetterAt(x, "AAAAAA", 1:6)
## hasLetterAt() can be used to answer questions like: "which elements
## in 'x' have an A at position 2 and a G at position 4?"
q1 <- hasLetterAt(x, "AG", c(2, 4))
which(rowSums(q1) == 2)
## or "how many probes in the drosophila2 chip have T, G, T, A at
## position 2, 4, 13 and 20, respectively?"
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)
q2 <- hasLetterAt(probes, "TGTA", c(2, 4, 13, 20))
sum(rowSums(q2) == 4)
## or "what's the probability to have an A at position 25 if there is
## one at position 13?"
q3 <- hasLetterAt(probes, "AACGT", c(13, 25, 25, 25, 25))
sum(q3[ , 1] & q3[ , 2]) / sum(q3[ , 1])
## Probabilities to have other bases at position 25 if there is an A
## at position 13:
sum(q3[ , 1] & q3[ , 3]) / sum(q3[ , 1]) # C
sum(q3[ , 1] & q3[ , 4]) / sum(q3[ , 1]) # G
sum(q3[ , 1] & q3[ , 5]) / sum(q3[ , 1]) # T
## See ?nucleotideFrequencyAt for another way to get those results.
## ---------------------------------------------------------------------
## neditAt() / isMatchingAt() / which.isMatchingAt()
## ---------------------------------------------------------------------
subject <- DNAString("GTATA")
## Pattern "AT" matches subject "GTATA" at position 3 (exact match)
neditAt("AT", subject, at=3)
isMatchingAt("AT", subject, at=3)
## ... but not at position 1
neditAt("AT", subject)
isMatchingAt("AT", subject)
## ... unless we allow 1 mismatching letter (inexact match)
isMatchingAt("AT", subject, max.mismatch=1)
## Here we look at 6 different starting positions and find 3 matches if
## we allow 1 mismatching letter
isMatchingAt("AT", subject, at=0:5, max.mismatch=1)
## No match
neditAt("NT", subject, at=1:4)
isMatchingAt("NT", subject, at=1:4)
## 2 matches if N is interpreted as an ambiguity (fixed=FALSE)
neditAt("NT", subject, at=1:4, fixed=FALSE)
isMatchingAt("NT", subject, at=1:4, fixed=FALSE)
## max.mismatch != 0 and fixed=FALSE can be used together
neditAt("NCA", subject, at=0:5, fixed=FALSE)
isMatchingAt("NCA", subject, at=0:5, max.mismatch=1, fixed=FALSE)
some_starts <- c(10:-10, NA, 6)
subject <- DNAString("ACGTGCA")
is_matching <- isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1)
some_starts[is_matching]
which.isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1)
which.isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1,
follow.index=TRUE)
## ---------------------------------------------------------------------
## WITH INDELS
## ---------------------------------------------------------------------
subject <- BString("ABCDEFxxxCDEFxxxABBCDE")
neditAt("ABCDEF", subject, at=9)
neditAt("ABCDEF", subject, at=9, with.indels=TRUE)
isMatchingAt("ABCDEF", subject, at=9, max.mismatch=1, with.indels=TRUE)
isMatchingAt("ABCDEF", subject, at=9, max.mismatch=2, with.indels=TRUE)
neditAt("ABCDEF", subject, at=17)
neditAt("ABCDEF", subject, at=17, with.indels=TRUE)
neditEndingAt("ABCDEF", subject, ending.at=22)
neditEndingAt("ABCDEF", subject, ending.at=22, with.indels=TRUE)
Run the code above in your browser using DataLab