Learn R Programming

smdc (version 0.0.2)

simDic: Document Similarity using Dictionary

Description

This function calculates the similarity between documents and documents by using dictionary.

Usage

simDic(docMatrix1, docMatrix2, scoreDict, breaks = seq(-1, 1, length = 11), norm = FALSE, method = "cosine", scoreFunc = mean)

Arguments

docMatrix1
Document matrix whose rows represent feature vector of one document. This matrix must satisfy the following: colnames(docMatrix1) denote feature names, rownames(docMatrix1) denote document names, every element is numerical.
docMatrix2
Document matrix whose rows represent feature vector of one document. This matrix must satisfy the following: colnames(docMatrix2) denote feature names, rownames(docMatrix2) denote document names, every element is numerical.
scoreDict
Dictionary matrix which converts features to numbers. This matrix must k * 2 matrix: 1st colmn represents features and 2nd column represents corresponding number. Similarity is calculated according to the number.
breaks
Range vector of frequency distribution. Each element must be ascending order.
norm
Whether normalize similarity matrix or not.
method
Method to caluculate similarity.
scoreFunc
Function of scoring from dictionary.

Value

Similarity Matrix whose rows represent documents of docMatrix1 and whose columns represent documents of docMatrix2. This matrix is n * m matrix where n=ncol(docMatrix1) and m=ncol(docMatrix2), and satisfy the following: rownames(returnValue)=colnames(docMatrix1), colnames(returnValue)=colnames(docMatrix2).

Examples

Run this code

## The function is currently defined as
function (docMatrix1, docMatrix2, scoreDict, breaks = seq(-1, 
    1, length = 11), norm = FALSE, method = "cosine", scoreFunc = mean) 
{
    library("proxy")
    words <- unique(rbind(matrix(rownames(docMatrix1)), matrix(rownames(docMatrix2))))
    words <- words[order(words)]
    wordScores <- rep(NA, length(words))
    for (i in 1:length(words)) {
        cond <- (scoreDict[, 1] == words[i])
        value <- scoreDict[cond, 2]
        if (length(value) != 0) {
            wordScores[i] <- scoreFunc(value, na.rm = TRUE)
        }
    }
    names(breaks) <- cut(breaks, breaks)
    wordClass <- cut(wordScores, breaks)
    names(wordClass) <- words
    docFreq1 <- conv2Freq(docMatrix1, wordClass, breaks)
    docFreq2 <- conv2Freq(docMatrix2, wordClass, breaks)
    colnames(docFreq1) <- paste("r_", colnames(docMatrix1), sep = "")
    colnames(docFreq2) <- paste("c_", colnames(docMatrix2), sep = "")
    sim <- as.matrix(simil(t(cbind(docFreq1, docFreq2)), method = method))[colnames(docFreq1), 
        colnames(docFreq2)]
    rownames(sim) <- colnames(docMatrix1)
    colnames(sim) <- colnames(docMatrix2)
    if (norm) {
        sim <- normalize(sim)
    }
    return(sim)
  }

Run the code above in your browser using DataLab