textcat_profile_db: Textcat Profile Dbs

Description

Create \(n\)-gram profile dbs for text categorization.

Usage

textcat_profile_db(x, id = NULL, method = NULL, ...,
                   options = list(), profiles = NULL)

Arguments

x: a character vector of text documents, or an R object of text documents extractable via as.character.
id: a character vector giving the categories of the texts to be recycled to the length of x, or NULL (default), indicating to treat each text document separately.
method: a character string specifying a built-in method, or a user-defined function for computing distances between \(n\)-gram profiles, or NULL (default), corresponding to using the method and options used for creating profiles if this is not NULL, or otherwise the current value of textcat option profile_method (see textcat_options).
...: options to be passed to the method for creating profiles.
options: a list of such options.
profiles: a textcat profile db object.

Details

The text documents are split according to the given categories, and \(n\)-gram profiles are computed using the specified method, with options either those used for creating profiles if this is not NULL, or by combining the options given in ... and options and merging with the default profile options specified by the textcat option profile_options using exact name matching. The method and options employed for building the db are stored in the db as attributes "method" and "options", respectively.

There is a c method for combining profile dbs provided that these have identical options. There are also a [ method for subscripting and as.matrix and as.simple_triplet_matrix methods to “export” the profiles to a dense matrix or the sparse simple triplet matrix representation provided by package slam, respectively.

Currently, the only available built-in method is "textcnt", which has the following options:

n:

A numeric vector giving the numbers of characters or bytes in the \(n\)-gram profiles.

Default: 1 : 5.

split:

The regular expression pattern to be used in word splitting.

Default: "[[:space:][:punct:][:digit:]]+".

perl:

A logical indicating whether to use Perl-compatible regular expressions in word splitting.

Default: FALSE.

tolower:

A logical indicating whether to transform texts to lower case (after word splitting).

Default: TRUE.

reduce:

A logical indicating whether a representation of \(n\)-grams more efficient than the one used by Cavnar and Trenkle should be employed.

Default: TRUE.

useBytes:

A logical indicating whether to use byte \(n\)-grams rather than character \(n\)-grams.

Default: FALSE.

ignore:

a character vector of \(n\)-grams to be ignored when computing \(n\)-gram profiles.

Default: "_" (corresponding to a word boundary).

size:

The maximal number of \(n\)-grams used for a profile.

Default: 1000L.

This method uses textcnt in package tau for computing \(n\)-gram profiles, with n, split, perl and useBytes corresponding to the respective textcnt arguments, and option reduce setting argument marker as needed. \(N\)-grams listed in option ignore are removed, and only the most frequent remaining ones retained, with the maximal number given by option size.

Unless the profile db uses bytes rather than characters (i.e., option useBytes is TRUE), text documents in x containing non-ASCII characters must declare their encoding (see Encoding), and will be re-encoded to UTF-8.

Note that option n specifies all numbers of characters or bytes to be used in the profiles, and not just the maximal number: e.g., taking n = 3 will create profiles only containing tri-grams.

Examples

Run this code

## Obtain the texts of the standard licenses shipped with R.
files <- dir(file.path(R.home("share"), "licenses"), "^[A-Z]",
             full.names = TRUE)
texts <- sapply(files,
                function(f) paste(readLines(f), collapse = "\n"))
names(texts) <- basename(files)
## Build a profile db using the same method and options as for building
## the ECIMCI character profiles.
profiles <- textcat_profile_db(texts, profiles = ECIMCI_profiles)
## Inspect the 10 most frequent n-grams in each profile.
lapply(profiles, head, 10L)
## Combine into one frequency table.
tab <- as.matrix(profiles)
tab[, 1 : 10]
## Determine languages.
textcat(profiles, ECIMCI_profiles)

Run the code above in your browser using DataLab