Performs evaluation on a word clustering task by comparing a flat clustering solution based on semantic distances with a gold classification.
eval.clustering(task, M, dist.fnc = pair.distances, …,
details = FALSE, format = NA, taskname = NA,
scale.entropy = FALSE, n.clusters = NA,
word.name = "word", class.name = "class")
a data frame listing words and their classes, usually in columns named word
and class
a scored DSM matrix, passed to dist.fnc
a callback function used to compute distances between word pairs.
It will be invoked with character vectors containing the components of the word pairs as first and second argument,
the DSM matrix M
as third argument, plus any additional arguments (…
) passed to eval.multiple.choice
.
The return value must be a numeric vector of appropriate length. If one of the words in a pair is not represented in the DSM,
the corresponding distance value should be set to Inf
.
any further arguments are passed to dist.fnc
and can be used e.g. to select a distance measure
if TRUE
, a detailed report with information on each task item is returned (see “Value” below for details)
if the task definition specifies POS-disambiguated lemmas in CWB/Penn format, they can automatically be transformed into some other notation conventions; see convert.lemma
for details
optional row label for the short report (details=FALSE
)
whether to scale cluster entropy values to the range \([0, 1]\)
number of clusters. The (very sensible) default is to generate as many clusters as their are classes in the gold standard.
the name of the column of task
containing words
the name of the column of task
containing gold standard classes
The default short report (details=FALSE
) is a data frame with a single row and the columns
purity
(clustering purity as a percentage), entropy
(scaled or unscaled clustering entropy)
and missing
(number of words not found in the DSM).
The detailed report (details=TRUE
) is a data frame with one row for each test word and the following columns:
the test word (character)
cluster to which the word has been assigned; all unknown words are collected in an additional cluster "n/a"
majority label of this cluster (factor with same levels as gold
)
gold standard class of the test word (factor)
whether majority class assignment is correct (logical)
whether word was not found in the DSM (logical)
The test words are clustered using the “partitioning around medoids” (PAM) algorithm (Kaufman \& Rousseeuw 1990, Ch. 2) based on their semantic distances. The PAM algorithm is used because it works with arbitrary distance measures (including neihbour rank), produces a stable solution (unlike most iterative algorithms) and has shown to be on par with state-of-the-art spherical k-means clustering (CLUTO) in evaluation studies.
Each cluster is automatically assigned a majority label, i.e. the gold standard class occurring most frequently in the cluster. This represents the best possible classification that can be derived from the clustering.
As evaluation metrics, clustering purity (accuracy of the majority classification) and entropy are computed.
The latter is defined as a weighted average over the entropy of the class distribution within each cluster, expressed in bits.
If scale.entropy=TRUE
, the value is divided by the overall entropy of the class distribution in the gold standard, scaling it to the range \([0, 1]\).
NB: The semantic distance measure selected with the extra arguments (…
) should be symmetric.
In particular, it is not very sensible to specify rank="fwd"
or rank="bwd"
.
NB: Similarity measures are not supported by the current clustering algorithm. Make sure not to call dist.matrix
(from dist.fnc
) with convert=FALSE
!
Suitable gold standard data sets in this package: ESSLLI08_Nouns
Support functions: pair.distances
, convert.lemma
# NOT RUN {
eval.clustering(ESSLLI08_Nouns, DSM_Vectors, class.name="class2")
# }
Run the code above in your browser using DataLab