A dataset containing a list U.S. specific, canned regular expressions for use in various functions within the qdapRegex package.
data(regex_usa)
A list with 54 elements
Use qdapRegex:::examine_regex()
to interactively explore the
regular expressions in regex_usa
. This will provide a browser + console
based break down of each regex in the dictionary.
The following canned regular expressions are included:
abbreviations containing single lower case or capital letter followed by a period and then an optional space (this must be repeated 2 or more times)
Remove characters between a left and right boundary including the boundaries; note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own
Remove characters between a left and right boundary NOT including the boundaries; note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own
words containing 2 or more consecutive upper case letters and no lower case
phrases of 1 word or more containing 1 or more consecutive upper case letters and no lower case; if phrase is one word long then phrase must be 2 or more consecutive capital letters
substring that looks for in-text and parenthetical APA6 style citations (attempts to exclude references)
substring that looks for in-text APA6 style citations (attempts to exclude references)
substring that looks for parenthetical APA6 style citations (attempts to exclude references)
substring with city (single lower case word or multiple consecutive capitalized words before a comma and state) & state (2 consecutive capital letters)
substring with city (single lower case word or multiple consecutive capitalized words before a comma and state) & state (2 consecutive capital letters) & zip code (exactly 5 or 5+4 consecutive digits)
dates in the form of 2 digit month, 2 digit day, and 2 or 4 digit year. Separator between month, day, and year may be dot (.), slash (/), or dash (-)
dates in the form of 3-9 letters followed by one or more spaces, 2 digits, a comma(,), one or more spaces, and 4 digits
dates in the form of XXXX-XX-XX; hyphen separated string of 4 digit year, 2 digit month, and 2 digit day
dates in the form of both rm_date
, rm_date2
, and rm_date3
substring with dollar sign ($) followed by (1) just dollars (no decimal), (2) dollars and cents (whole number and decimal), or (3) just cents (decimal value); dollars may contain commas
substring with (1) alphanumeric characters or dash (-), plus (+), or underscore (_) (This may be repeated) (2) followed by at (@), followed by the same regex sequence as before the at (@), and ending with dot (.) and 2-14 digits
common emoticons (logic is complicated to explain in words) using ">?[:;=8XB]{1}[-~+o^]?[|\")(>DO>{pP3/]+|</?3|XD+|D:<|x[-~+o^]?[|\")(>DO>{pP3/]+" regex pattern; general pattern is optional hat character, followed by eyes character, followed by optional nose character, and ending with a mouth character
substring of the last endmark group in a string; endmarks include (! ? . * OR |)
substring of the last endmark group in a string; endmarks include (! ? OR .)
substring of the last endmark group in a string; endmarks include (! ? . * | ; OR :)
substring that begins with a hash (#) followed by a word
substring of letters (that may contain apostrophes) n letters long (apostrophe not counted in length); note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own
substring of letters (that may contain apostrophes) n letters long (apostrophe counted in length); note contains "%s"
that is replaced by sprintf
and is not a valid regex on its own
substring of 2 digits or letters a-f inside of a left and right angle brace in the form of "<a4>"
substring of any character that isn't a letter, apostrophe, or single space
substring that may begin with dash (-) for negatives, and is (1) just whole number (no decimal), (2) whole number and decimal, or (3) just decimal value; regex pattern provided by Jason Gray
substring beginning with (1) just whole number (no decimal), (2) whole number and decimal, or (3) just decimal value and followed by a percent sign (%)
phone numbers in the form of optional country code, valid 3 digit prefix, and 7 digits (may contain hyphens and parenthesis); logic is complex to explain (see https://stackoverflow.com/a/21008254/1000343 for more)
U.S. state abbreviations (and District of Columbia) that is constrained to just possible U.S. state names, not just two consecutive capital letters; taken from Mike Hamilton's submission found https://regexlib.com/REDetails.aspx?regexp_id=2177
substring with a repetition of repeated characters within a word; regex pattern retrieved from StackOverflow's, vks: https://stackoverflow.com/a/29438461/1000343
substring with a phrase (a sequence of 1 or more words) that is repeated 2 or more times (case is ignored; separating periods and commas are ignored); regex pattern retrieved from StackOverflow's, BrodieG: https://stackoverflow.com/a/28786617/1000343
substring with a word (marked with a boundary) that is repeat 2 or more times (case is ignored)
substring that begins with an at (@) followed by a word
Twitter substring that begins with an at (@) followed by a word composed of alpha-numeric characters and underscores, no longer than 15 characters
substring beginning with title (Mrs., Mr., Ms., Dr.) that is case independent or full title (Miss, Mizz, mizz) followed by a single lower case word or multiple capitalized words
substring that (1) must begin with 0-2 digits, (2) must be followed by a single colon (:), (3) optionally may be followed by either a colon (:) or a dot (.), (4) optionally may be followed by 1-infinite digits (if previous condition is true)
substring that is identical to rm_time
with the additional search for Ante Meridiem/Post Meridiem abbreviations (e.g., AM, p.m., etc.)
substring that is specific to transcription time stamps in the form of HH:MM:SS.OS where OS is milliseconds. HH: and .OS are optional. The SS.OS period divide may also be a comma or additional colon. The HH:SS divid may also be a period. String may be affixed with pound sign (#).
Twitter short link/url; substring optionally beginning with http, followed by t.co ending on a space or end of string (whichever comes first)
substring beginning with http, www., or ftp and ending on a space or end of string (whichever comes first); note that this regex is simple and may not cover all valid URLs or may include invalid URLs
substring beginning with http, www., or ftp and more constrained than rm_url
; based on @imme_emosol's response from https://mathiasbynens.be/demo/url-regex
substring beginning with http or ftp and more constrained than rm_url
& rm_url2
though light-weight, making it ideal for validation purposes; taken from @imme_emosol's response found https://mathiasbynens.be/demo/url-regex
substring of white space(s); this regular expression combines rm_white_bracket
, rm_white_colon
, rm_white_comma
, rm_white_endmark
, rm_white_lead
, rm_white_trail
, and rm_white_multiple
substring of white space(s) following left brackets ("{", "(", "[") or preceding right brackets ("}", ")", "]")
substring of white space(s) preceding colon(s)/semicolon(s)
substring of white space(s) preceding a comma
substring of white space(s) preceding a single occurrence/combination of period(s), question mark(s), and exclamation point(s)
substring of leading white space(s)
substring of leading/trailing white space(s)
substring of multiple, consecutive white spaces
substring of white space(s) preceding a comma or a single occurrence/combination of colon(s), semicolon(s), period(s), question mark(s), and exclamation point(s)
substring of trailing white space(s)
substring of 5 digits optionally followed by a dash and 4 more digits