The text documents are split according to the given categories, and
\(n\)-gram profiles are computed using the specified method, with
options either those used for creating profiles
if this is not
NULL
, or by combining the options given in ...
and
options
and merging with the default profile options specified
by the textcat option profile_options
using exact
name matching. The method and options employed for building the db
are stored in the db as attributes "method"
and
"options"
, respectively.
There is a c
method for combining profile dbs provided
that these have identical options. There are also a [
method
for subscripting and as.matrix
and
as.simple_triplet_matrix
methods to
“export” the profiles to a dense matrix or the sparse simple
triplet matrix representation provided by package slam,
respectively.
Currently, the only available built-in method is "textcnt"
,
which has the following options:
n
:
A numeric vector giving the numbers of characters or bytes in the
\(n\)-gram profiles.
Default: 1 : 5
.
split
:
The regular expression pattern to be used in word splitting.
Default: "[[:space:][:punct:][:digit:]]+"
.
perl
:
A logical indicating whether to use Perl-compatible regular
expressions in word splitting.
Default: FALSE
.
tolower
:
A logical indicating whether to transform texts to lower case
(after word splitting).
Default: TRUE
.
reduce
:
A logical indicating whether a representation of \(n\)-grams
more efficient than the one used by Cavnar and Trenkle should be
employed.
Default: TRUE
.
useBytes
:
A logical indicating whether to use byte \(n\)-grams rather than
character \(n\)-grams.
Default: FALSE
.
ignore
:
a character vector of \(n\)-grams to be ignored when computing
\(n\)-gram profiles.
Default: "_"
(corresponding to a word boundary).
size
:
The maximal number of \(n\)-grams used for a profile.
Default: 1000L
.
This method uses textcnt
in package tau for
computing \(n\)-gram profiles, with n
, split
,
perl
and useBytes
corresponding to the respective
textcnt
arguments, and option reduce
setting argument
marker
as needed. \(N\)-grams listed in option ignore
are removed, and only the most frequent remaining ones retained, with
the maximal number given by option size
.
Unless the profile db uses bytes rather than characters (i.e., option
useBytes
is TRUE
), text documents in x
containing
non-ASCII characters must declare their encoding (see
Encoding
), and will be re-encoded to UTF-8.
Note that option n
specifies all numbers of characters
or bytes to be used in the profiles, and not just the maximal number:
e.g., taking n = 3
will create profiles only containing
tri-grams.