Learn R Programming

corpora (version 0.6)

simulated.wikipedia: Simulated type and token counts for Wikipedia articles (corpora)

Description

This function generates type and token counts, token-type ratios (TTR) and average word length for simulated articles from the English Wikipedia. Simulation paramters are based on data from the Wackypedia corpus.

The generated data set is usually named WackypediaStats (see code examples below) and is used for various exercises and illustrations in the SIGIL course.

Usage

simulated.wikipedia(N=1429649, length=c(100,1000), seed.rng=42)

Value

A data frame with N rows corresponding to Wikipedia articles and the following columns:

tokens:

number of word tokens in the article

types:

number of distinct word types in the article

ttr:

token-type ratio (TTR) for the article

avglen:

average word length in characters (averaged across tokens)

Arguments

N

population size, i.e. total number of Wikipedia articles

length

a numeric vector of length 2, specifying the typical range of Wikipedia article lengths

seed.rng

seed for the random number generator, so data sets with the same parameters (N and lenght) are reproducible

Author

Stephanie Evert (https://purl.org/stephanie.evert)

Details

The default population size corresponds to the subset of the Wackypedia corpus from which the simulation parameters were obtained. This excludes all articles with extreme type-token statistics (very short, very long, extremely long words, etc.).

Article lengths are sampled from a lognormal distribution which is scaled so that the central 95% of the values fall into the range specified by the length argument.

The simulated data are surprising close to the original Wackypedia statistics.

References

The Wackypedia corpus can be obtained from https://wacky.sslmit.unibo.it/doku.php?id=corpora.

Examples

Run this code

WackypediaStats <- simulated.wikipedia()
summary(WackypediaStats)

# \dontshow{
  # some consistency checks
  stopifnot(nrow(WackypediaStats) == 1429649) 
# }

Run the code above in your browser using DataLab