BrownBigrams: Bigrams of adjacent words from the Brown corpus

Description

This data set contains bigrams of adjacent word forms from the Brown corpus of written American English (Francis & Kucera 1964). Co-occurrence frequencies are specified in the form of an observed contingency table, using the notation suggested by Evert (2008).

Only bigrams that occur at least 5 times in the corpus are included.

Usage

BrownBigrams

Arguments

Format

A data frame with 24167 rows and the following columns:

id:: unique ID of the bigram entry
word1:: the first word form in the bigram (character)
pos1:: part-of-speech category of the first word (factor)
word2:: the second word form in the bigram (character)
pos2:: part-of-speech category of the second word (factor)
O11:: co-occurrence frequency of the bigram (numeric)
O12:: occurrences of the first word without the second (numeric)
O21:: occurrences of the second word without the first (numeric)
O22:: number of bigram tokens containing neither the first nor the second word (numeric)

Author

Stephanie Evert (https://purl.org/stephanie.evert)

Details

Part-of-speech categories are identified by single-letter codes, corresponding of the first character of the Penn tagset.

Some important POS codes are N (noun), V (verb), J (adjective), R (adverb or particle), I (preposition), D (determiner), W (wh-word) and M (modal).

References

Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212--1248. Mouton de Gruyter, Berlin, New York.

Francis, W.~N. and Kucera, H. (1964). Manual of information to accompany a standard sample of present-day edited American English, for use with digital computers. Technical report, Department of Linguistics, Brown University, Providence, RI.