A multilingual text corpus of speeches from a European Parliament debate on coal subsidies in 2010, with individual crowd codings as the unit of observation. The sentences are drawn from officially translated speeches from a debate over a European Parliament debate concerning a Commission report proposing an extension to a regulation permitting state aid to uncompetitive coal mines.
Each speech is available in six languages: English, German, Greek, Italian, Polish and Spanish. The unit of observation is the individual crowd coding of each natural sentence. For more information on the coding approach see Benoit et al. (2016).
data_corpus_EPcoaldebate
The corpus consists of 16,806 documents (i.e. codings of a sentence) and includes the following document-level variables:
character; a unique identifier for each sentence
factor; whether a coder labelled the sentence as "Pro-Subsidy", "Anti-Subsidy" or "Neutral or inapplicable"
factor; the language (translation) of the speech
character; speaker's last name
character; speaker's first name
factor; abbreviation of the EP party group of the speaker
factor; the speaker's country of origin
factor; the speaker's vote on the proposal (For/Against/Abstain/NA)
character; a unique identifier for each crowd coder
numeric; the "trust score" from the Crowdflower platform used to code the sentences, which can theoretically range between 0 and 1. Only coders with trust scores above 0.8 are included in the corpus.
A corpus object.
Benoit, K., Conway, D., Lauderdale, B.E., Laver, M., & Mikhaylov, S. (2016). Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data. American Political Science Review, 100,(2), 278--295. tools:::Rd_expr_doi("10.1017/S0003055416000058")