Regexp_Tokenizer()
creates regexp span tokenizers which use the
given pattern
and ...
arguments to match tokens or
separators between tokens via gregexpr()
, and then
transform the results of this into character spans of the tokens
found.
whitespace_tokenizer()
tokenizes by treating any sequence of
whitespace characters as a separator.
blankline_tokenizer()
tokenizes by treating any sequence of
blank lines as a separator.
wordpunct_tokenizer()
tokenizes by matching sequences of
alphabetic characters and sequences of (non-whitespace) non-alphabetic
characters.