The corpora
package provides a collection of functions for statistical inference
from corpus frequency data, as well as some convenience functions and example data sets.
It is a companion package to the open-source course Statistical Inference: a Gentle Introduction for Linguists and similar creatures originally developed by Marco Baroni and Stephanie Evert. Statistical methods implemented in the package are described and illustrated in the units of this course.
Starting with version 0.6 the package also includes best-practice implementations of various corpus-linguistic analysis techniques.
Stephanie Evert (https://purl.org/stephanie.evert)
An overview of some important functions and data sets included in the corpora
package.
See the package index for a complete listing.
keyness()
provides reference implementations for best-practice keyness measures, including the recommended LRC measure (Evert 2022)
binom.pval()
is a vectorised function that computes p-values of the binomial test more efficiently than binom.test
(using central p-values in the two-sided case)
fisher.pval()
is a vectorised function that efficiently computes p-values of Fisher's exact test on \(2\times 2\) contingency tables for large samples (using central p-values in the two-sided case)
prop.cint()
is a vectorised function that computes multiple binomial confidence intervals much more efficiently than binom.test
z.score()
and z.score.pval()
can be used to carry out a z-test for a single proportion (as an approximation to binom.test
)
chisq()
and chisq.pval()
are vectorised functions that compute the test statistic and p-value of a chi-squared test for \(2\times 2\) contingency tables more efficiently than chisq.test
cont.table()
creates \(2\times 2\) contingency tables for frequency comparison test that can be passed to chisq.test
and fisher.test
sample.df()
extracts random samples of rows from a data frame
qw()
splits a string on whitespace or a user-specified regular expression (similar to Perl's qw//
construct)
corpora.palette()
provides some nice colour palettes (better than R's default colours)
rowVector()
and colVector()
convert a vector into a single-row or single-column matrix
Several data sets based on the British National Corpus, including complete metadata for all 4048 text files (BNCmeta
), per-text frequency counts for a number of linguistic corpus queries (BNCqueries
), and relative frequencies of 65 lexico-grammatical features for each text (BNCbiber
)
Frequency counts of passive constructions in all texts of the Brown and LOB corpora (BrownLOBPassives
) for frequency comparison with regression models, complemented by distributional features (DistFeatBrownFam
) as additional predictors
A small text corpus of Very Short Stories in the form of a data frame VSS
, with one row for each token in the corpus.
Small example tables to illustrate frequency comparison of lexical items (BNCcomparison
) and collocation analysis (BNCInChargeOf
)
KrennPPV
is a data set of German verb-preposition-noun collocation candidates with manual annotation of true positives and pre-computed association scores
Three functions for generating large synthetic data sets used in the SIGIL course: simulated.census()
, simulated.language.course()
and simulated.wikipedia()
The official homepage of the corpora
package and the SIGIL course is http://SIGIL.R-Forge.R-Project.org/.