Learn R Programming

wordspace (version 0.2-0)

eval.clustering: Evaluate DSM on Clustering Task (wordspace)

Description

Performs evaluation on a word clustering task by comparing a flat clustering solution based on semantic distances with a gold classification.

Usage

eval.clustering(task, M, dist.fnc = pair.distances, …,
                details = FALSE, format = NA, taskname = NA,
                scale.entropy = FALSE, n.clusters = NA,
                word.name = "word", class.name = "class")

Arguments

task

a data frame listing words and their classes, usually in columns named word and class

M

a scored DSM matrix, passed to dist.fnc

dist.fnc

a callback function used to compute distances between word pairs. It will be invoked with character vectors containing the components of the word pairs as first and second argument, the DSM matrix M as third argument, plus any additional arguments () passed to eval.multiple.choice. The return value must be a numeric vector of appropriate length. If one of the words in a pair is not represented in the DSM, the corresponding distance value should be set to Inf.

any further arguments are passed to dist.fnc and can be used e.g. to select a distance measure

details

if TRUE, a detailed report with information on each task item is returned (see “Value” below for details)

format

if the task definition specifies POS-disambiguated lemmas in CWB/Penn format, they can automatically be transformed into some other notation conventions; see convert.lemma for details

taskname

optional row label for the short report (details=FALSE)

scale.entropy

whether to scale cluster entropy values to the range \([0, 1]\)

n.clusters

number of clusters. The (very sensible) default is to generate as many clusters as their are classes in the gold standard.

word.name

the name of the column of task containing words

class.name

the name of the column of task containing gold standard classes

Value

The default short report (details=FALSE) is a data frame with a single row and the columns purity (clustering purity as a percentage), entropy (scaled or unscaled clustering entropy) and missing (number of words not found in the DSM).

The detailed report (details=TRUE) is a data frame with one row for each test word and the following columns:

word

the test word (character)

cluster

cluster to which the word has been assigned; all unknown words are collected in an additional cluster "n/a"

label

majority label of this cluster (factor with same levels as gold)

gold

gold standard class of the test word (factor)

correct

whether majority class assignment is correct (logical)

missing

whether word was not found in the DSM (logical)

Details

The test words are clustered using the “partitioning around medoids” (PAM) algorithm (Kaufman \& Rousseeuw 1990, Ch. 2) based on their semantic distances. The PAM algorithm is used because it works with arbitrary distance measures (including neihbour rank), produces a stable solution (unlike most iterative algorithms) and has shown to be on par with state-of-the-art spherical k-means clustering (CLUTO) in evaluation studies.

Each cluster is automatically assigned a majority label, i.e. the gold standard class occurring most frequently in the cluster. This represents the best possible classification that can be derived from the clustering.

As evaluation metrics, clustering purity (accuracy of the majority classification) and entropy are computed. The latter is defined as a weighted average over the entropy of the class distribution within each cluster, expressed in bits. If scale.entropy=TRUE, the value is divided by the overall entropy of the class distribution in the gold standard, scaling it to the range \([0, 1]\).

NB: The semantic distance measure selected with the extra arguments () should be symmetric. In particular, it is not very sensible to specify rank="fwd" or rank="bwd".

NB: Similarity measures are not supported by the current clustering algorithm. Make sure not to call dist.matrix (from dist.fnc) with convert=FALSE!

See Also

Suitable gold standard data sets in this package: ESSLLI08_Nouns

Support functions: pair.distances, convert.lemma

Examples

Run this code
# NOT RUN {
eval.clustering(ESSLLI08_Nouns, DSM_Vectors, class.name="class2")

# }

Run the code above in your browser using DataLab