Learn R Programming

corpora (version 0.6)

DistFeatBrownFam: Latent dimension scores from a distributional analysis of the Brown Family corpora

Description

This data frame provides unsupervised distributional features for each text in the extended Brown Family of corpora (Brown, LOB, Frown, FLOB, BLOB), covering edited written American and British English from 1930s, 1960s and 1990s (see Xiao 2008, 395--397).

Latent topic dimensions were obtained by a method similar to Latent Semantic Indexing (Deerwester et al. 1990), applying singular value decomposition to bag-of-words vectors for the 2500 texts in the extended Brown Family. Register dimensions were obtained with the same methodology, using vectors of part-of-speech frequencies (separately for all verb-related tags and all other tags).

Usage

DistFeatBrownFam

Arguments

Format

A data frame with 2500 rows and the following 23 columns:

id:

A unique ID for each text (also used as row name)

top1, top2, top3, top4, top5, top6, top7, top8, top9:

latent dimension scores for the first 9 topic dimensions

reg1, reg2, reg3, reg4, reg5, reg6, reg7, reg8, reg9:

latent dimension scores for the first 9 register dimensions (excluding verb-related tags)

vreg1, vreg2, vreg3, vreg4:

latent dimension scores for the first 4 register dimensions based only on verb-related tags

Author

Stephanie Evert (https://purl.org/stephanie.evert)

Details

TODO

References

Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard (1990). Indexing by latent semantic analysis. Journal of the American Society For Information Science, 41(6), 391--407.

Xiao, Richard (2008). Well-known and influential corpora. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 20, pages 383--457. Mouton de Gruyter, Berlin.