The vocabulary of this DSM covers several basic evaluation tasks, including RG65
, WordSim353
and ESSLLI08_Nouns
, as well as the target nouns bank and vessel from SemCorWSD
. In addition, 40 nearest neighbours each of the words white_J
, apple_N
, kindness_N
and walk_V
are included.
Co-occurrence frequency data were extracted from a collection of Web corpora with a total size of ca. 9 billion words, using a L4/R4 surface window and 30,000 lexical words as feature terms. They were scored with sparse simple log-likelihood with an additional log transformation, normalized to Euclidean unit length, and projected into 1000 latent dimensions using randomized SVD (see rsvd
. For size reasons, the vectors have been compressed into 50 latent dimensions and renormalized.