These generic functions are used to build dictionaries from a text
source, or to coerce other formats to dictionary
, and from a
dictionary
to a character vector. By now, the only
non-trivial type coercible to dictionary
is character
,
in which case each entry of the input vector is considered as a single word.
Coercion from dictionary
to character
returns the list of
words included in the dictionary as a regular character vector.
Dictionaries can be built from text coming either directly from a
character vector, or from a connection. The second option is useful if one
wants to avoid loading the full text corpus in physical memory,
allowing to process text from different sources such as files, compressed
files or URLs.
A single preprocessing transformation can be applied before processing the
text for unique words. After preprocessing,
anything delimited by one or more white space characters
in the transformed text input is counted as a word and may be added
to the dictionary modulo additional constraints.
The possible constraints for including a word in the dictionary can be of
three types: (i) fixed size of dictionary, implemented by the size
argument; (ii) fixed text covering fraction, as specified by the cov
argument; or (iii) minimum word count threshold, thresh
argument.
Only one of these constraints can be applied at a time,
so that specifying more than one of size
, cov
or thresh
raises an error.