This data set contains bigrams of adjacent word forms from the Brown corpus of written American English (Francis & Kucera 1964). Co-occurrence frequencies are specified in the form of an observed contingency table, using the notation suggested by Evert (2008).
Only bigrams that occur at least 5 times in the corpus are included.
BrownBigrams
A data frame with 24167 rows and the following columns:
id
:unique ID of the bigram entry
word1
:the first word form in the bigram (character)
pos1
:part-of-speech category of the first word (factor)
word2
:the second word form in the bigram (character)
pos2
:part-of-speech category of the second word (factor)
O11
:co-occurrence frequency of the bigram (numeric)
O12
:occurrences of the first word without the second (numeric)
O21
:occurrences of the second word without the first (numeric)
O22
:number of bigram tokens containing neither the first nor the second word (numeric)
Stephanie Evert (https://purl.org/stephanie.evert)
Part-of-speech categories are identified by single-letter codes, corresponding of the first character of the Penn tagset.
Some important POS codes are
N
(noun), V
(verb), J
(adjective), R
(adverb or particle),
I
(preposition), D
(determiner), W
(wh-word) and M
(modal).
Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212--1248. Mouton de Gruyter, Berlin, New York.
Francis, W.~N. and Kucera, H. (1964). Manual of information to accompany a standard sample of present-day edited American English, for use with digital computers. Technical report, Department of Linguistics, Brown University, Providence, RI.