Learn R Programming

quanteda (version 0.9.9-50)

similarity: compute similarities between documents and/or features

Description

Compute similarities between documents and/or features from a dfm. Uses the similarity measures defined in simil. See pr_DB for available distance measures, or how to create your own.

Usage

similarity(x, selection = NULL, n = NULL, margin = c("documents",
  "features"), method = "correlation", sorted = TRUE, normalize = FALSE)

# S4 method for dfm similarity(x, selection = NULL, n = NULL, margin = c("documents", "features"), method = "correlation", sorted = TRUE, normalize = FALSE)

# S3 method for similMatrix as.matrix(x, ...)

Arguments

x
a dfm object
selection
character or character vector of document names or feature labels from the dfm
n
the top n most similar items will be returned, sorted in descending order. If n is NULL, return all items.
margin
identifies the margin of the dfm on which similarity will be computed: documents for documents or features for word/term features.
method
a valid method for computing similarity from pr_DB
sorted
sort results in descending order if TRUE
normalize
a deprecated argument retained (temporarily) for legacy reasons. If you want to compute similarity on a "normalized" dfm objects (e.g. x), wrap it in weight(x, "relFreq").
...
unused

Value

a named list of the selection labels, with a sorted named vector of similarity measures.

Examples

Run this code
# create a dfm from inaugural addresses from Reagan onwards
presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980), stem = TRUE,
               remove = stopwords("english"))

# compute some document similarities
(tmp <- similarity(presDfm, margin = "documents"))
# output as a matrix
as.matrix(tmp)
# for specific comparisons
similarity(presDfm, "1985-Reagan", n = 5, margin = "documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n = 5, margin = "documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents", method = "cosine")
similarity(presDfm, "2005-Bush", margin = "documents", method = "eJaccard", sorted = FALSE)

# compute some term similarities
similarity(presDfm, c("fair", "health", "terror"), method="cosine", margin = "features", 20)

Run the code above in your browser using DataLab