Learn R Programming

corpora (version 0.6)

VSS: A small corpus of very short stories with linguistic annotations

Description

This data set contains a small corpus (8043 tokens) of short stories from the collection Very Short Stories (VSS, see http://www.schtepf.de/History/pages/stories.html). The text was automatically segmented (tokenised) and annotated with part-of-speech tags (from the Penn tagset) and lemmas (base forms), using the IMS TreeTagger (Schmid 1994) and a custom lemmatizer.

Usage

VSS

Arguments

Format

A data set with 8043 rows corresponding to tokens and the following columns:

word:

the word form (or surface form) of the token

pos:

the part-of-speech tag of the token (Penn tagset)

lemma:

the lemma (or base form) of the token

sentence:

number of the sentence in which the token occurs (integer)

story:

title of the story to which the token belongs (factor)

Author

Stephanie Evert (https://purl.org/stephanie.evert)

Details

The Penn tagset defines the following part-of-speech tags:

CCCoordinating conjunction
CDCardinal number
DTDeterminer
EXExistential there
FWForeign word
INPreposition or subordinating conjunction
JJAdjective
JJRAdjective, comparative
JJSAdjective, superlative
LSList item marker
MDModal
NNNoun, singular or mass
NNSNoun, plural
NPProper noun, singular
NPSProper noun, plural
PDTPredeterminer
POSPossessive ending
PPPersonal pronoun
PP$Possessive pronoun
RBAdverb
RBRAdverb, comparative
RBSAdverb, superlative
RPParticle
SYMSymbol
TOto
UHInterjection
VBVerb, base form
VBDVerb, past tense
VBGVerb, gerund or present participle
VBNVerb, past participle
VBPVerb, non-3rd person singular present
VBZVerb, 3rd person singular present
WDTWh-determiner
WPWh-pronoun
WP$Possessive wh-pronoun
WRBWh-adverb

References

Schmid, Helmut (1994). Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing (NeMLaP), pages 44-49.