Learn R Programming

quanteda (version 0.9.9-65)

textstat_dist: Similarity and distance computation between documents or features

Description

These functions compute matrixes of distances and similarities between documents or features from a dfm and return a dist object (or a matrix if specific targets are selected).

Usage

textstat_dist(x, selection = NULL, margin = c("documents", "features"),
  method = "euclidean", upper = FALSE, diag = FALSE, p = 2)

textstat_simil(x, selection = NULL, margin = c("documents", "features"), method = "correlation", upper = FALSE, diag = FALSE)

Arguments

x

a dfm object

selection

character vector of document names or feature labels from x. A "dist" object is returned if selection is NULL, otherwise, a matrix is returned.

margin

identifies the margin of the dfm on which similarity or difference will be computed: documents for documents or features for word/term features.

method

method the similarity or distance measure to be used; see Details

upper

whether the upper triangle of the symmetric \(V \times V\) matrix is recorded

diag

whether the diagonal of the distance matrix should be recorded

p

The power of the Minkowski distance.

Details

textstat_dist options are: "euclidean" (default), "Chisquared", "Chisquared2", "hamming", "kullback". "manhattan", "maximum", "canberra", and "minkowski".

textstat_simil options are: "correlation" (default), "cosine", "jaccard", "eJaccard", "dice", "eDice", "simple matching", "hamann", and "faith".

See Also

textstat_dist, as.list.dist, dist

Examples

Run this code
# NOT RUN {
# create a dfm from inaugural addresses from Reagan onwards
presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1990), 
               remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
               
# distances for documents 
(d1 <- textstat_dist(presDfm, margin = "documents"))
as.matrix(d1)

# distances for specific documents
textstat_dist(presDfm, "2017-Trump", margin = "documents")
textstat_dist(presDfm, "2005-Bush", margin = "documents", method = "eJaccard")
(d2 <- textstat_dist(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents"))
as.list(d1)

# similarities for documents
(s1 <- textstat_simil(presDfm, method = "cosine", margin = "documents"))
as.matrix(s1)
as.list(s1)

# similarities for for specific documents
textstat_simil(presDfm, "2017-Trump", margin = "documents")
textstat_simil(presDfm, "2017-Trump", method = "cosine", margin = "documents")
textstat_simil(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents")

# compute some term similarities
s2 <- textstat_simil(presDfm, c("fair", "health", "terror"), method = "cosine", 
                      margin = "features")
head(as.matrix(s2), 10)
as.list(s2, n = 8)

# }

Run the code above in your browser using DataLab