stri_match_all: Extract Regex Pattern Matches, Together with Capture Groups

Description

These functions extract substrings of str that match a given regex pattern. Additionally, they extract matches to every capture group, i.e. to all the subpatterns given in round parentheses.

Usage

stri_match_all(str, ..., regex)
stri_match_first(str, ..., regex)
stri_match_last(str, ..., regex)
stri_match(str, ..., regex, mode = c("first", "all", "last"))
stri_match_all_regex(str, pattern, opts_regex = NULL)
stri_match_first_regex(str, pattern, opts_regex = NULL)
stri_match_last_regex(str, pattern, opts_regex = NULL)

Arguments

str

character vector with strings to search in

...

additional arguments passed to the underlying functions

mode

single string; one of: "first" (the default), "all", "last"

pattern,regex

character vector defining regex patterns to search for; for more details refer to stringi-search-regex

opts_regex

a named list with ICU Regex settings as generated with stri_opts_regex; NULL for default settings;

Value

For stri_match_all*, a list of character matrices is returned. Each list element represents the results of a separate search scenario.
For stri_match_first* and stri_match_last*, on the other hand, a character matrix is returned. Here the search results are provided as separate rows.
The first matrix column gives the whole match. The second one corresponds to the first capture group, the third -- the second capture group, and so on.

Details

Vectorized over str and pattern.

If no pattern match is detected or if a capture group match is unavailable, then NAs are included in the resulting matrix (matrices), see Examples.

By the way, ICU regex engine currently does not support named capture groups.

stri_match, stri_match_all, stri_match_first, and stri_match_last are convenience functions. They just call stri_match_*_regex -- they have been provided for consistency with other string searching functions' wrappers, cf. e.g. stri_extract.

Examples

Run this code

stri_match_all_regex("breakfast=eggs, lunch=pizza, dessert=icecream",
   "(\\w+)=(\\w+)")
stri_match_all_regex(c("breakfast=eggs", "lunch=pizza", "no food here"),
   "(\\w+)=(\\w+)")
stri_match_all_regex(c("breakfast=eggs;lunch=pizza",
   "breakfast=bacon;lunch=spaghetti", "no food here"),
   "(\\w+)=(\\w+)")
stri_match_first_regex(c("breakfast=eggs;lunch=pizza",
   "breakfast=bacon;lunch=spaghetti", "no food here"),
   "(\\w+)=(\\w+)")
stri_match_last_regex(c("breakfast=eggs;lunch=pizza",
   "breakfast=bacon;lunch=spaghetti", "no food here"),
   "(\\w+)=(\\w+)")

# Match all the pattern of the form XYX, including overlapping matches:
stri_match_all_regex("ACAGAGACTTTAGATAGAGAAGA", "(?=(([ACGT])[ACGT]\\2))")[[1]][,2]
# Compare the above to:
stri_extract_all_regex("ACAGAGACTTTAGATAGAGAAGA", "([ACGT])[ACGT]\\1")

Run the code above in your browser using DataLab