This data set provides complete metadata for all 4048 texts of the British National Corpus (XML edition). See Aston & Burnard (1998) for more information about the BNC, or go to http://www.natcorp.ox.ac.uk/.
The data have automatically been extracted from the original BNC source files. Some transformations were applied so that all attribute names and their values are given in a human-readable form. The Perl scripts used in the extraction procedure are available from https://cwb.sourceforge.io/install.php#other.
BNCmeta
A data frame with 4048 rows and the columns listed below. Unless specified otherwise, columns are coded as factors.
id
:BNC document ID; character vector
title
:Title of the document; character vector
n_words
:Number of words in the document; integer vector
n_tokens
:Total number of tokens (including punctuation and deleted material); integer vector
n_w
:Number of w-units (words); integer vector
n_c
:Number of c-units (punctuation); integer vector
n_s
:Number of s-units (sentences); integer vector
publication_date
:Publication date
text_type
:Text type
context
:Spoken context
respondent_age
:Age-group of respondent
respondent_class
:Social class of respondent (NRS social grades)
respondent_sex
:Sex of respondent
interaction_type
:Interaction type
region
:Region
author_age
:Author age-group
author_domicile
:Domicile of author
author_sex
:Sex of author
author_type
:Author type
audience_age
:Audience age
domain
:Written domain
difficulty
:Written difficulty
medium
:Written medium
publication_place
:Publication place
sampling_type
:Sampling type
circulation
:Estimated circulation size
audience_sex
:Audience sex
availability
:Availability
mode
:Text mode (written/spoken)
derived_type
:Text class
genre
:David Lee's genre classification
Stephanie Evert (https://purl.org/stephanie.evert)
Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.