This data frame provides unsupervised distributional features for each text in the extended Brown Family of corpora (Brown, LOB, Frown, FLOB, BLOB), covering edited written American and British English from 1930s, 1960s and 1990s (see Xiao 2008, 395--397).
Latent topic dimensions were obtained by a method similar to Latent Semantic Indexing (Deerwester et al. 1990), applying singular value decomposition to bag-of-words vectors for the 2500 texts in the extended Brown Family. Register dimensions were obtained with the same methodology, using vectors of part-of-speech frequencies (separately for all verb-related tags and all other tags).
DistFeatBrownFam
A data frame with 2500 rows and the following 23 columns:
id
:A unique ID for each text (also used as row name)
top1
, top2
, top3
, top4
, top5
, top6
, top7
, top8
, top9
:latent dimension scores for the first 9 topic dimensions
reg1
, reg2
, reg3
, reg4
, reg5
, reg6
, reg7
, reg8
, reg9
:latent dimension scores for the first 9 register dimensions (excluding verb-related tags)
vreg1
, vreg2
, vreg3
, vreg4
:latent dimension scores for the first 4 register dimensions based only on verb-related tags
Stephanie Evert (https://purl.org/stephanie.evert)
TODO
Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard (1990). Indexing by latent semantic analysis. Journal of the American Society For Information Science, 41(6), 391--407.
Xiao, Richard (2008). Well-known and influential corpora. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 20, pages 383--457. Mouton de Gruyter, Berlin.