The Eskin similarity measure was proposed by Eskin et al. (2002). It is constructed to assign
higher weights to mismatches on variables with more categories, see (Boriah et al., 2008).
Hierarchical clustering methods require a proximity (dissimilarity) matrix instead of a similarity matrix as
an entry for the analysis; therefore, dissimilarity D
is computed from similarity S
according the equation
1/S-1
.
The use and evaluation of clustering with this measure can be found e.g. in (Sulc and Rezankova, 2014).
eskin(data)
data frame with cases in rows and variables in colums. Cases are characterized by nominal (categorical) variables coded as numbers.
Function returns a matrix of the size n x n
, where n
is a number of objects in original data. The matrix contains proximities
between all pairs of objects. It can be used in hierarchical cluster analyses (HCA), e.g. in agnes
.
Boriah, S., Chandola and V., Kumar, V. (2008). Similarity measures for categorical data: A comparative evaluation. In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Eskin, E., Arnold, A., Prerau, M., Portnoy, L. and Stolfo, S. (2002). A geometric framework for unsupervised anomaly detection. In D. Barbara and S. Jajodia (Eds): Applications of Data Mining in Computer Security, p. 78-100. Norwell: Kluwer Academic Publishers.
Sulc, Z. and Rezankova, H. (2014). Evaluation of recent similarity measures for categorical data. In: AMSE. Wroclaw: Wydawnictwo Uniwersytetu Ekonomicznego we Wroclawiu, p. 249-258. Available at: http://www.amse.ue.wroc.pl/papers/Sulc,Rezankova.pdf.
good1
,
good2
,
good3
,
good4
,
iof
,
lin
,
lin1
,
morlini
,
of
,
sm
,
ve
,
vm
.
# NOT RUN {
#sample data
data(data20)
# Creation of proximity matrix
prox_eskin <- eskin(data20)
# }
Run the code above in your browser using DataLab