kgram_freqs: k-gram Frequency Tables

Description

Extract k-gram frequency counts from a text or a connection.

Principal methods supported by objects of class `kgram_freqs`

query(): query k-gram counts from the table. See query
probability(): compute word continuation and sentence probabilities using Maximum Likelihood estimates. See probability.
language_model(): build a k-gram language model using various probability smoothing techniques. See language_model.

Usage

kgram_freqs(object, ...)
# S3 method for numeric
kgram_freqs(
  object,
  .preprocess = identity,
  .tknz_sent = identity,
  dict = NULL,
  ...
)
# S3 method for kgram_freqs
kgram_freqs(object, ...)
# S3 method for character
kgram_freqs(
  object,
  N,
  .preprocess = identity,
  .tknz_sent = identity,
  dict = NULL,
  open_dict = is.null(dict),
  verbose = TRUE,
  ...
)
# S3 method for connection
kgram_freqs(
  object,
  N,
  .preprocess = identity,
  .tknz_sent = identity,
  dict = NULL,
  open_dict = is.null(dict),
  verbose = TRUE,
  max_lines = max_lines,
  batch_size = NULL,
  ...
)
process_sentences(
  text,
  freqs,
  .preprocess = attr(freqs, ".preprocess"),
  .tknz_sent = attr(freqs, ".tknz_sent"),
  open_dict = TRUE,
  in_place = TRUE,
  verbose = TRUE,
  ...
)
# S3 method for character
process_sentences(
  text,
  freqs,
  .preprocess = attr(freqs, ".preprocess"),
  .tknz_sent = attr(freqs, ".tknz_sent"),
  open_dict = TRUE,
  in_place = TRUE,
  verbose = TRUE,
  ...
)
# S3 method for connection
process_sentences(
  text,
  freqs,
  .preprocess = attr(freqs, ".preprocess"),
  .tknz_sent = attr(freqs, ".tknz_sent"),
  open_dict = TRUE,
  in_place = TRUE,
  verbose = TRUE,
  max_lines = Inf,
  batch_size = max_lines,
  ...
)

Arguments

object

any type allowed by the available methods. The type defines the behaviour of kgram_freqs() as a default constructor, a copy constructor or a constructor of a non-trivial object. See <U+2018>Details<U+2019>.

...

further arguments passed to or from other methods.

.preprocess

a function taking a character vector as input and returning a character vector as output. Optional preprocessing transformation applied to text before k-gram tokenization. See <U+2018>Details<U+2019>.

.tknz_sent

a function taking a character vector as input and returning a character vector as output. Optional sentence tokenization step applied to text after preprocessing and before k-gram tokenization. See <U+2018>Details<U+2019>.

dict

anything coercible to class dictionary. Optional pre-specified word dictionary.

a length one integer. Maximum order of k-grams to be considered.

open_dict

TRUE or FALSE. If TRUE, any new word encountered during processing not appearing in the original dictionary is included into the dictionary. Otherwise, new words are replaced by an unknown word token. It is by default TRUE if dict is specified, FALSE otherwise.

verbose

Print current progress to the console.

max_lines

a length one positive integer or Inf. Maximum number of lines to be read from the connection. If Inf, keeps reading until the End-Of-File.

batch_size

a length one positive integer less than or equal to max_lines.Size of text batches when reading text from connection.

text

a character vector or a connection. Source of text from which k-gram frequencies are to be extracted.

freqs

a kgram_freqs object, to which new k-gram counts from text are to be added.

in_place

TRUE or FALSE. Should the initial kgram_freqs object be modified in place?

Value

A kgram_freqs class object: k-gram frequency table storing k-gram counts from text. For process_sentences(), the updated kgram_freqs object is returned invisibly if in_place is TRUE, visibly otherwise.

Details

The function kgram_freqs() is a generic constructor for objects of class kgram_freqs, i.e. k-gram frequency tables. The constructor from integer returns an empty 'kgram_freqs' of fixed order, with an optional predefined dictionary (which can be empty) and .preprocess and .tknz_sent functions to be used as defaults in other kgram_freqs methods. The constructor from kgram_freqs returns a copy of an existing object, and it is provided because, in general, kgram_freqs objects have reference semantics, as discussed below.

The following discussion focuses on process_sentences() generic, as well as on the character and connection methods of the constructor kgram_freqs(). These functions extract k-gram frequency counts from a text source, which may be either a character vector or a connection. The second option is useful if one wants to avoid loading the full text corpus in physical memory, allowing to process text from different sources such as files, compressed files or URLs.

The returned object is of class kgram_freqs (a thin wrapper around the internal C++ class where all k-gram computations take place). kgram_freqs objects have methods for querying bare k-gram frequencies (query) and maximum likelihood estimates of sentence probabilities or word continuation probabilities (see probability)) . More importantly kgram_freqs objects are used to create language_model objects, which support various probability smoothing techniques.

The function kgram_freqs() is used to construct a new kgram_freqs object, initializing it with the k-gram counts from the text input, whereas process_sentences() is used to add k-gram counts from a new text to an existing kgram_freqs object, freqs. In this second case, the initial object freqs can either be modified in place (for in_place == TRUE, the default) or by making a copy (in_place == FALSE), see the examples below. The final object is returned invisibly when modifying in place, visibly in the second case. It is worth to mention that modifying in place a kgram_freqs object freqs will also affect language_model objects created from freqs with language_model(), which will also be updated with the new information. If one wants to avoid this behaviour, one can make copies using either the kgram_freqs() copy constructor, or the in_place = FALSE argument.

The dict argument allows to provide an initial set of known words. Subsequently, one can either work with such a closed dictionary (open_dict == FALSE), or extended the dictionary with all new words encountered during k-gram processing (open_dict == TRUE) .

The .preprocess and .tknz_sent functions are applied before k-gram counting takes place, and are in principle arbitrary transformations of the original text. After preprocessing and sentence tokenization, each line of the transformed input is presented to the k-gram counting algorithm as a separate sentence (these sentences are implicitly padded with N - 1 Begin-Of-Sentence (BOS) and one End-Of-Sentence (EOS) tokens, respectively. This is illustrated in the examples). For basic usage, this package offers the utilities preprocess and tknz_sent. Notice that, strictly speaking, there is some redundancy in these two arguments, as the processed input to the k-gram counting algorithm is .tknz_sent(.preprocess(text)). They appear explicitly as separate arguments for two main reasons:

The presence of .tknz_sent is a reminder of the fact that sentences have to be explicitly separeted in different entries of the processed input, in order for kgram_freqs() to append the correct Begin-Of-Sentence and End-Of-Sentence paddings to each sentence.
At prediction time (e.g. with probability), by default only .preprocess is applied when computing conditional probabilities, whereas both .preprocess() and .tknz_sent() are applied when computing sentence absolute probabilities.

Examples

Run this code

# NOT RUN {
# Build a k-gram frequency table from a character vector

f <- kgram_freqs("a b b a a", 3)
f
summary(f)
query(f, c("a", "b")) # c(3, 2)
query(f, c("a b", "a" %+% EOS(), BOS() %+% "a b")) # c(1, 1, 1)
query(f, "a b b a") # NA (counts for k-grams of order k > 3 are not known)

process_sentences("b", f)
query(f, c("a", "b")) # c(3, 3): 'f' is updated in place

f1 <- process_sentences("b", f, in_place = FALSE)
query(f, c("a", "b")) # c(3, 3): 'f' is copied
query(f1, c("a", "b")) # c(3, 4): the new 'f1' stores the updated counts




# Build a k-gram frequency table from a file connection

# }
# NOT RUN {
f <- kgram_freqs(file("myfile.txt"), 3)
# }
# NOT RUN {

# Build a k-gram frequency table from an URL connection
# }
# NOT RUN {
### Shakespeare's "Much Ado About Nothing" (entire play)
con <- url("http://shakespeare.mit.edu/much_ado/full.html")

# Apply some basic preprocessing
.preprocess <- function(x) {
        # Remove character names and locations (boldfaced in original html)
        x <- gsub("<b>[A-z]+</b>", "", x)
        # Remove other html tags
        x <- gsub("<[^>]+>||<[^>]+$||^[^>]+>$", "", x)
        # Apply standard preprocessing including lower-case
        x <- kgrams::preprocess(x)
        return(x)
}

.tknz_sent <- function(x) {
        # Tokenize sentences keeping Shakespeare's punctuation
        x <- kgrams::tknz_sent(x, keep_first = TRUE)
        # Remove empty sentences
        x <- x[x != ""]
        return(x)
}

f <- kgram_freqs(con, 3, .preprocess, .tknz_sent, batch_size = 1000)
summary(f)

query(f, c("leonato", "thy", "smartphones")) # c(145, 52, 0)
# }

Run the code above in your browser using DataLab