BrownSubsets: Brown Corpus Subset Frequency Data (zipfR)

Description

Objects of classes spc and vgc that contain frequency data for various subsets of words from the Brown corpus (see Kucera and Francis 1967).

Arguments

Details

BrownAdj.spc, BrownNoun.spc and BrownVer.spc are frequency spectra of all the Brown corpus words tagged as adjectives, nouns and verbs, respectively. BrownAdj.emp.vgc, BrownNoun.emp.vgc and BrownVer.emp.vgc are the corresponding observed vocabulary growth curves (tracking the development of V and V(1), like all the files with suffix .emp.vgc below).

BrownImag.spc and BrownInform.spc are frequency spectra of the Brown corpus words subdivided into the two main stylistic partitions of the corpus, i.e., imaginative and informative prose, respectively. BrownImag.emp.vgc and BrownInform.emp.vgc are the corresponding observed vocabulary growth curves.

Brown100k.spc is the spectrum of the first 100,000 tokens in the Brown (useful, e.g., for extrapolation experiments in which we want to estimate a lnre model on a subset of the data available). The corresponding observed growth curve can be easily obtained from the one for the whole Brown (Brown.emp.vgc).

Notice that we removed numbers and other forms of non-linguistic material before collecting any data from the Brown.

References

Kucera, H. and Francis, W.N. (1967). Computational analysis of present-day American English. Brown University Press, Providence.

Examples

Run this code

# NOT RUN {
  data(BrownAdj.spc)
  summary(BrownAdj.spc)

  data(BrownAdj.emp.vgc)
  summary(BrownAdj.emp.vgc)

  data(BrownInform.spc)
  summary(BrownInform.spc)

  data(BrownInform.emp.vgc)
  summary(BrownInform.emp.vgc)

  data(Brown100k.spc)
  summary(Brown100k.spc)

# }

Run the code above in your browser using DataLab

Description

Arguments

Details

References

See Also

Examples