Learn R Programming

⚠️There's a newer version (0.2.0) of this package.Take me there.

kgrams

kgrams provides tools for training and evaluating (k)-gram language models, including several probability smoothing methods, perplexity computations, random text generation and more. It is based on an C++ backend (which can be used itself as a standalone library for (k)-gram based NLP) which makes kgrams fast, coupled with an accessible R API which aims at streamlining the process of model building, and can be suitable for small- and medium-sized NLP experiments, baseline model building, and for pedagogical purposes.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("vgherard/kgrams")

Example

This example shows how to train a modified Kneser-Ney 4-gram model on Shakespeare’s “Much Ado About Nothing” using kgrams.

library(kgrams)
# Get k-gram frequency counts from text, for k = 1:4
freqs <- kgram_freqs(kgrams::much_ado, N = 4)
# Build modified Kneser-Ney 4-gram model, with discount parameters D1, D2, D3.
mkn <- language_model(freqs, smoother = "mkn", D1 = 0.25, D2 = 0.5, D3 = 0.75)

We can now use this language_model to compute sentence and word continuation probabilities:

# Compute sentence probabilities
probability(c("did he break out into tears ?",
              "we are predicting sentence probabilities ."
              ), 
            model = mkn
            )
#> [1] 2.466856e-04 1.184963e-20
# Compute word continuation probabilities
probability(c("tears", "pieces") %|% "did he break out into", model = mkn)
#> [1] 9.389238e-01 3.834498e-07

Here are some sentences sampled from the language model’s distribution at temperatures t = c(1, 0.1, 10):

# Compute sentence probabilities
set.seed(840)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 1)
#> [1] "i have studied eight or nine truly by your office [...] (truncated output)"
#> [2] "ere you go : <EOS>"                                                        
#> [3] "don pedro welcome signior : <EOS>"
sample_sentences(model = mkn, n = 3, max_length = 10, t = 0.1)
#> [1] "i will not be sworn but love may transform me [...] (truncated output)" 
#> [2] "i will not fail . <EOS>"                                                
#> [3] "i will go to benedick and counsel him to fight [...] (truncated output)"
sample_sentences(model = mkn, n = 3, max_length = 10, t = 10)
#> [1] "july cham's incite start ancientry effect torture tore pains endings [...] (truncated output)"   
#> [2] "lastly gallants happiness publish margaret what by spots commodity wake [...] (truncated output)"
#> [3] "born all's 'fool' nest praise hurt messina build afar dancing [...] (truncated output)"

Getting Help

For further help, you can consult the reference page of the kgrams website or open an issue on the GitHub repository of kgrams. A vignette is available on the website, illustrating the process of building language models in-depth.

Development

This project is in an early developmental stage, thorough tests of the algorithms and unit tests still need to be implemented, many computations leave some room for optimization, the API may change, etc.. If you feel like contributing to kgrams, here’s is some useful information.

Development of kgrams takes place on its GitHub repository. If you find a bug, please let me know by opening an issue, and if you have any ideas or proposals for improvement, please feel welcome to send a pull request, or simply an e-mail at vgherard@sissa.it.

Copy Link

Version

Install

install.packages('kgrams')

Monthly Downloads

502

Version

0.1.0

License

GPL (>= 3)

Maintainer

Valerio Gherardi

Last Published

February 15th, 2021

Functions in kgrams (0.1.0)

%+%

String concatenation
much_ado

Much Ado About Nothing
language_model

k-gram Language Models
parameters

Language Model Parameters
dictionary

Word dictionaries
kgram_freqs

k-gram Frequency Tables
kgrams-package

kgrams: Classical k-gram Language Models
midsummer

A Midsummer Night's Dream
perplexity

Language Model Perplexities
sample_sentences

Random Text Generation
probability

Language Model Probabilities
word_context

Word-context conditional expression
query

Query k-gram frequency tables or dictionaries
EOS

Special Tokens
tknz_sent

Sentence tokenizer
preprocess

Text preprocessing
smoothers

k-gram Probability Smoothers