tokenizers: Regexp tokenizers

Description

Tokenizers using regular expressions to match either tokens or separators between tokens.

Usage

Regexp_Tokenizer(pattern, invert = FALSE, ..., meta = list())
blankline_tokenizer(s)
whitespace_tokenizer(s)
wordpunct_tokenizer(s)

Value

Regexp_Tokenizer() returns the created regexp span tokenizer.

blankline_tokenizer(), whitespace_tokenizer() and

wordpunct_tokenizer() return the spans of the tokens found in

s.

Arguments

pattern: a character string giving the regular expression to use for matching.
invert: a logical indicating whether to match separators between tokens.
...: further arguments to be passed to gregexpr().
meta: a named or empty list of tokenizer metadata tag-value pairs.
s: a String object, or something coercible to this using as.String() (e.g., a character string with appropriate encoding information).

Details

Regexp_Tokenizer() creates regexp span tokenizers which use the given pattern and ... arguments to match tokens or separators between tokens via gregexpr(), and then transform the results of this into character spans of the tokens found.

whitespace_tokenizer() tokenizes by treating any sequence of whitespace characters as a separator.

blankline_tokenizer() tokenizes by treating any sequence of blank lines as a separator.

wordpunct_tokenizer() tokenizes by matching sequences of alphabetic characters and sequences of (non-whitespace) non-alphabetic characters.

Examples

Run this code

## A simple text.
s <- String("  First sentence.  Second sentence.  ")
##           ****5****0****5****0****5****0****5**

spans <- whitespace_tokenizer(s)
spans
s[spans]

spans <- wordpunct_tokenizer(s)
spans
s[spans]