The concatenator character is a special delimiter used to link
separate tokens in multi-token phrases. It is embedded in the meta-data of
tokens objects and used in downstream operations, such as tokens_compound()
or tokens_lookup()
. It can be extracted using concat()
and set using
tokens(x, concatenator = ...)
when x
is a tokens object.
The default _
is recommended since it will not be removed during normal
cleaning and tokenization (while nearly all other punctuation characters, at
least those in the Unicode punctuation class [P]
will be removed).