Learn R Programming

qlcMatrix (version 0.9.8)

bibles: A selection of bible-texts

Description

A selection of six bible texts as prepared by the paralleltext project.

Usage

data(bibles)

Arguments

Format

A list of five elements

verses

a character vector with all 43904 verse numbers as occurring throughout all translations as collected in the paralleltext project. This vector is used to align all texts to each other. The verse-numbers are treated as characters so ordering and matching works as expected.

eng

The English `Darby' Bible translation from 1890 by John Nelson Darby.

deu

The Bible in German. Schlachter Version von 1951. Genfer Bibelgesellschaft 1951.

tgl

The New Testament in Tagalog. Philippine Bible Society 1996.

aak

The New Testament in the Ankave language of Papua New Guinea. Wycliffe Bible Translators, Inc. 1990.

Details

Basically, all verse-numbering is harmonized, the text is unicode normalized, translations that capture multiple verses are included in the first of those verses, with the others left empty. Empty verses are thus a sign of combined translations. Verses that are not translated simply do not occur in the original files. Most importantly, the text are tokenized as to wordform, i.e. all punctuation and other non-word-based symbols are separated by spaces. In this way, space can be used for a quick wordform-based tokenization. The addition of spaces has been manually corrected to achieve a high precision of language-specific wordform tokenization.

The Bible texts are provided as named vectors of strings, each containing one verse. The names of the vector are codes for the verses. See Mayer & Cysouw (2014) for more information about the verse IDs and other formatting issues.

References

Mayer, Thomas and Michael Cysouw. 2014. Creating a massively parallel Bible corpus. Proceedings of LREC 2014.

Examples

Run this code
# ----- load data -----

data(bibles)

# ----- separate into sparse matrices -----

# use splitText to turn a bible into a sparse matrix of wordforms x verses
E <- splitText(bibles$eng, simplify = TRUE, lowercase = FALSE)

# all wordforms from the first verse
# (internally using pure Unicode collation, i.e. ordering is determined by Unicode numbering)
which(E[,1] > 0)

# ----- co-occurrence across text -----

# how often do 'father' and 'mother' co-occur in one verse?
# (ignore warnings of chisq.test, because we are not interested in p-values here)

( cooc <- table(E["father",] > 0, E["mother",] > 0) )
suppressWarnings( chisq.test(cooc)$residuals )

# the function 'sim.words' does such computations efficiently 
# for all 15000 x 15000 pairs of words at the same time

system.time( sim <- sim.words(bibles$eng, lowercase = FALSE) )
sim["father", "mother"]

Run the code above in your browser using DataLab