Learn R Programming

kgrams (version 0.1.0)

sample_sentences: Random Text Generation

Description

Sample sentences from a language model's probability distribution.

Usage

sample_sentences(model, n, max_length, t = 1)

Arguments

model

either an object of class language_model, or a kgram_freqs object. The language model from which probabilities are computed.

n

an integer. Number of sentences to sample.

max_length

an integer. Maximum length of sampled sentences.

t

a positive number. Sampling temperature (optional); see Details.

Value

a character vector of length n. Random sentences generated from the language model's distribution.

Details

This function samples sentences according the prescribed language model's probability distribution, with an optional temperature parameter. The temperature transform of a probability distribution is defined by p(t) = exp(log(p) / t) / Z(t) where Z(t) is the partition function, fixed by the normalization condition sum(p(t)) = 1.

Sampling is performed word by word, using the already sampled string as context, starting from the Begin-Of-Sentence context (i.e. N - 1 BOS tokens). Sampling stops either when an End-Of-Sentence token is encountered, or when the string exceeds max_length, in which case a truncated output is returned.

A word of caution on some special smoothers: 'sbo' smoother (Stupid Backoff), does not produce normalized continuation probabilities, but rather continuation scores. Sampling is here performed by assuming that Stupid Backoff scores are proportional to actual probabilities. 'ml' smoother (Maximum Likelihood) does not assign probabilities when the k-gram count of the context is zero. When this happens, the next word is chosen uniformly at random from the model's dictionary.

Examples

Run this code
# NOT RUN {
# Sample sentences from 8-gram Kneser-Ney model trained on Shakespeare's
# "Much Ado About Nothing"

# }
# NOT RUN {
### Prepare the model and set seed
freqs <- kgram_freqs(much_ado, 8, .tknz_sent = tknz_sent)
model <- language_model(freqs, "kn", D = 0.75)
set.seed(840)

sample_sentences(model, n = 3, max_length = 10)

### Sampling at high temperature
sample_sentences(model, n = 3, max_length = 10, t = 100)

### Sampling at low temperature
sample_sentences(model, n = 3, max_length = 10, t = 0.01)

# }

Run the code above in your browser using DataLab