of: Occurence Frequency (OF) Measure

Description

The OF (Occurrence Frequency) measure was originally constructed for the text mining, see (Sparck-Jones, 1972), later, it was adjusted for categorical variables. It assigns higher similarity to mismatches on less frequent values and otherwise. Hierarchical clustering methods require a proximity (dissimilarity) matrix instead of a similarity matrix as an entry for the analysis; therefore, dissimilarity D is computed from similarity S according the equation 1/S-1.

Usage

of(data)

Arguments

data

data frame or matrix with cases in rows and variables in colums. Cases are characterized by nominal (categorical) variables coded as numbers.

Value

Function returns a matrix of the size n x n, where n is the number of objects in original data. The matrix contains proximities between all pairs of objects. It can be used in hierarchical cluster analyses (HCA), e.g. in agnes.

References

Boriah, S., Chandola and V., Kumar, V. (2008). Similarity measures for categorical data: A comparative evaluation. In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254. Available at: http://www-users.cs.umn.edu/~sboriah/PDFs/BoriahBCK2008.pdf.

Spark-Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. In Journal of Documentation, 28(1), p. 11-21. Later: Journal of Documentation, 60(5) (2002), p. 493-502.

Sulc, Z. and Rezankova, H. (2014). Evaluation of recent similarity measures for categorical data. In: AMSE. Wroclaw: Wydawnictwo Uniwersytetu Ekonomicznego we Wroclawiu, p. 249-258. Available at: http://www.amse.ue.wroc.pl/papers/Sulc,Rezankova.pdf.

Examples

Run this code

#sample data
data(data20)
# Creation of proximity matrix
prox_of <- of(data20)

Run the code above in your browser using DataLab