textcnt: Term or Pattern Counting of Text Documents

Description

This function provides a common interface to perform typical term or pattern counting tasks on text documents.

Usage

textcnt(x, n = 3L, split = "[[:space:][:punct:][:digit:]]+",
        tolower = TRUE, marker = "_", words = NULL, lower = 0L,
        method = c("ngram", "string", "prefix", "suffix"),
        recursive = FALSE, persistent = FALSE, useBytes = FALSE,
        perl = TRUE, verbose = FALSE, decreasing = FALSE)
# S3 method for textcnt
format(x, ...)

Value

Either a single vector of counts of mode integer with the names indexing the patterns counted, or a list of such vectors with the components corresponding to the individual documents. Note that by default the counts are in prefix tree (byte) order (for

method = "suffix" this is the order of the reversed strings). Otherwise, if decreasing = TRUE the counts are sorted in decreasing order. Note that the (default) order of ties is preserved (see sort).

Arguments

x: a (list of) vector(s) of character representing one (or more) text document(s).
n: the maximum number of characters considered in ngram, prefix, or suffix counting (for word counting see details).
split: the regular expression pattern (PCRE) to be used in word splitting (if NULL, do nothing).
tolower: option to transform the documents to lowercase (after word splitting).
marker: the string used to mark word boundaries.
words: the number of words to use from the beginning of a document (if NULL, all words are used).
lower: the lower bound for a count to be included in the result set(s).
method: the type of counts to compute.
recursive: option to compute counts for individual documents (default all documents).
persistent: option to count documents incrementally.
useBytes: option to process byte-by-byte instead of character-by-character.
perl: option to use PCRE in word splitting.
verbose: option to obtain timing statistics.
decreasing: option to return the counts in decreasing order.
...: further (unused) arguments.

Author

Christian Buchta

Details

The following counting methods are currently implemented:

ngram: Count all word n-grams of order 1,...,n.
string: Count all word sequence n-grams of order n.
prefix: Count all word prefixes of at most length n.
suffix: Count all word suffixes of at most length n.

The n-grams of a word are defined to be the substrings of length n = min(length(word), n) starting at positions 1,...,length(word)-n. Note that the value of marker is pre- and appended to word before counting. However, the empty word is never marked and therefore not counted. Note that marker = "\1" is reserved for counting of an efficient set of ngrams and marker = "\2" for the set proposed by Cavnar and Trenkle (see references).

If method = "string" word-sequences of and only of length n are counted. Therefore, documents with less than n words are omitted.

By default all documents are preprocessed and counted using a single C function call. For large document collections this may come at the price of considerable memory consumption. If persistent = TRUE and recursive = TRUE documents are counted incrementally, i.e., into a persistent prefix tree using as many C function calls as there are documents. Further, if persistent = TRUE and recursive = FALSE the documents are counted using a single call but no result is returned until the next call with persistent = FALSE. Thus, persistent acts as a switch with the counts being accumulated until release. Timing statistics have shown that incremental counting can be order of magnitudes faster than the default.

Be aware that the character strings in the documents are translated to the encoding of the current locale if the encoding is set (see Encoding). Therefore, with the possibility of "unknown" encodings when in an "UTF-8" locale, or invalid "UTF-8" strings declared to be in "UTF-8", the code checks if each string is a valid "UTF-8" string and stops if not. Otherwise, strings are processed bytewise without any checks. However, embedded nul bytes are always removed from a string. Finally, note that during incremental counting a change of locale is not allowed (and a change in method is not recommended).

Note that the C implementation counts words into a prefix tree. Whereas this is highly efficient for n-gram, prefix, or suffix counting it may be less efficient for simple word counting. That is, implementations which use hash tables may be more efficient if the dictionary is large.

format.textcnt pretty prints a named vector of counts (see below) including information about the rank and encoding details of the strings.

References

W.B. Cavnar and J.M. Trenkle (1994). N-Gram Based Text Categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 161--175.

Examples

Run this code

## the classic
txt <- "The quick brown fox jumps over the lazy dog."

##
textcnt(txt, method = "ngram")
textcnt(txt, method = "prefix", n = 5L)

r <- textcnt(txt, method = "suffix", lower = 1L)
data.frame(counts = unclass(r), size = nchar(names(r)))
format(r)

## word sequences
textcnt(txt, method = "string")

## inefficient
textcnt(txt, split = "", method = "string", n = 1L)

## incremental
textcnt(txt, method = "string", persistent = TRUE, n = 1L)
textcnt(txt, method = "string", n = 1L)

## subset
textcnt(txt, method = "string", words = 5L, n = 1L)

## non-ASCII
txt <- "The quick br\xfcn f\xf6x j\xfbmps \xf5ver the lazy d\xf6\xf8g."
Encoding(txt) <- "latin1"
txt

## implicit translation
r <- textcnt(txt, method = "suffix")
table(Encoding(names(r)))
r
## efficient sets
textcnt("is",     n = 3L, marker = "\1")
textcnt("is",     n = 4L, marker = "\1")
textcnt("corpus", n = 5L, marker = "\1")
## CT sets
textcnt("corpus", n = 5L, marker = "\2")

Run the code above in your browser using DataLab