dictionary: Word dictionaries

Description

Construct or coerce to and from a dictionary.

Usage

dictionary(text, ...)
# S3 method for character
dictionary(
  text,
  .preprocess = identity,
  size = NULL,
  cov = NULL,
  thresh = NULL,
  ...
)
# S3 method for connection
dictionary(
  text,
  .preprocess = identity,
  size = NULL,
  cov = NULL,
  thresh = NULL,
  batch_size = NULL,
  ...
)
as_dictionary(object)
# S3 method for kgrams_dictionary
as_dictionary(object)
# S3 method for character
as_dictionary(object)
# S3 method for kgrams_dictionary
as.character(x, ...)

Arguments

text

a character vector, a connection or missing/NULL. Source of text from which k-gram frequencies are to be extracted.

...

further arguments passed to or from other methods.

.preprocess

a function taking a character vector as input and returning a character vector as output. Optional preprocessing transformation applied to text before creating the dictionary.

size

either NULL or a positive integer. Predefined size of the required dictionary (the top size most frequent words are retained).

cov

either NULL or a number between 0 and 1. Predefined text coverage fraction of the dictionary (the most frequent words providing the required coverage are retained).

thresh

either NULL or a positive integer. Predefined text coverage fraction of the dictionary (the most frequent words providing the required coverage are retained).

batch_size

a length one positive integer or NULL. Size of text batches when reading text from a file or a generic connection. If NULL, all input text is processed in a single batch.

object

an object to be coerced to dictionary.

a dictionary.

Value

A dictionary for dictionary() and as_dictionary(), a character vector for the as.character() method.

Details

These generic functions are used to build dictionaries from a text source, or to coerce other formats to dictionary, and from a dictionary to a character vector. By now, the only non-trivial type coercible to dictionary is character, in which case each entry of the input vector is considered as a single word. Coercion from dictionary to character returns the list of words included in the dictionary as a regular character vector.

Dictionaries can be built from text coming either directly from a character vector, or from a connection. The second option is useful if one wants to avoid loading the full text corpus in physical memory, allowing to process text from different sources such as files, compressed files or URLs.

A single preprocessing transformation can be applied before processing the text for unique words. After preprocessing, anything delimited by one or more white space characters in the transformed text input is counted as a word and may be added to the dictionary modulo additional constraints.

The possible constraints for including a word in the dictionary can be of three types: (i) fixed size of dictionary, implemented by the size argument; (ii) fixed text covering fraction, as specified by the cov argument; or (iii) minimum word count threshold, thresh argument. Only one of these constraints can be applied at a time, so that specifying more than one of size, cov or thresh raises an error.

Examples

Run this code

# NOT RUN {
# Building a dictionary from Shakespeare's "Much Ado About Nothing"

dict <- dictionary(much_ado)
length(dict)
query(dict, "leonato") # TRUE
query(dict, c("thy", "thou")) # c(TRUE, TRUE)
query(dict, "smartphones") # FALSE

# Getting list of words as regular character vector
words <- as.character(dict)
head(words)

# Building a dictionary from a list of words
dict <- as_dictionary(c("i", "the", "a"))

# }

Run the code above in your browser using DataLab