This data set lists 5102 frequent combinations of verbs and prepositional phrases (PP) extracted from a German newspaper corpus. The collocational status of each PP-verb combination was manually annotated by Brigitte Krenn (2000). In addition, pre-computed scores of several standard association measures are provided.
The KrennPPV
candidate set forms part of the data used in the evaluation study
of Evert & Krenn (2005).
KrennPPV
A data frame with 5102 rows and the following columns:
PP
:the prepositional phrase, represented by preposition and lemma of the nominal head (character).
Preposition-article fusion is indicated by a +
sign. For example, the prepositional phrase
im letzten Jahr would appear as in:Jahr
in the data set.
verb
:the verb lemma (character). Separated particle verbs have been recombined.
is.colloc
:whether the PP-verb combination is a lexical collocation (logical)
is.SVC
:whether a PP-verb collocation is a support verb construction (logical)
is.figur
:whether a PP-verb-collocation is a figurative expression (logical)
freq
:co-occurrence frequency of the PP-verb combination within clauses (integer)
MI
:Mutual Information association measure
Dice
:Dice coefficient association measure
z.score
:z-score association measure
t.score
:t-score association measure
chisq
:chi-squared association measure (without Yates' continuity correction)
chisq.corr
:chi-squared association measure (with Yates' continuity correction)
log.like
:log-likelihood association measure
Fisher
:Fisher's exact test as an association measure (negative logarithm of one-sided p-value)
See Evert (2008) and http://www.collocations.de/AM/ for details on these association measures.
Stephanie Evert (https://purl.org/stephanie.evert)
Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212--1248. Mouton de Gruyter, Berlin, New York.
Evert, Stefan and Krenn, Brigitte (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language, 19(4), 450--466.
Krenn, Brigitte (2000). The Usual Suspects: Data-Oriented Models for the Identification and Representation of Lexical Collocations, volume~7 of Saarbrücken Dissertations in Computational Linguistics and Language Technology. DFKI & Universität des Saarlandes, Saarbrücken, Germany.