The Goodall 4 similarity measure was firstly introduced in (Boriah et al., 2008).
The measure ssigns higher similarity if the frequent categories match.
When measuring similarity between two variables, this measure provides complement results of Goodall 3 to one.
Hierarchical clustering methods require a proximity (dissimilarity) matrix instead of a similarity matrix as
an entry for the analysis; therefore, dissimilarity D
is computed from similarity S
according the equation
1/S-1
.
The use and evaluation of clustering with this measure can be found e.g. in (Sulc, 2015).
good4(data)
data frame with cases in rows and variables in colums. Cases are characterized by nominal (categorical) variables coded as numbers.
Function returns a matrix of the size n x n
, where n
is the number of objects in original data. The matrix contains proximities
between all pairs of objects. It can be used in hierarchical cluster analyses (HCA), e.g. in agnes
.
Boriah, S., Chandola and V., Kumar, V. (2008). Similarity measures for categorical data: A comparative evaluation. In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254. Available at: http://www-users.cs.umn.edu/~sboriah/PDFs/BoriahBCK2008.pdf.
Goodall, V.D. (1966). A new similarity index based on probability. Biometrics. Vol. 22, No.4, p. 882.
Sulc, Z. (2015). Application of Goodall's and Lin's similarity measures in hierarchical clustering. In Sbornik praci vedeckeho seminare doktorskeho studia FIS VSE. Praha: Oeconomica, 2015, p. 112-118. Available at: http://fis.vse.cz/wp-content/uploads/2015/01/DD_FIS_2015_CELY_SBORNIK.pdf.
eskin
,
good1
,
good2
,
good3
,
iof
,
lin
,
lin1
,
morlini
of
,
sm
,
ve
,
vm
.
#sample data
data(data20)
# Creation of proximity matrix
prox_goodall_4 <- good4(data20)
Run the code above in your browser using DataLab