textstat_dist: Similarity and distance computation between documents or features

Description

These functions compute matrixes of distances and similarities between documents or features from a dfm and return a dist object (or a matrix if specific targets are selected).

Usage

textstat_dist(x, selection = NULL, n = NULL, margin = c("documents",
  "features"), method = "euclidean", upper = FALSE, diag = FALSE, p = 2)
textstat_simil(x, selection = NULL, n = NULL, margin = c("documents",
  "features"), method = "correlation", upper = FALSE, diag = FALSE)

Arguments

a dfm object

selection

character vector of document names or feature labels from x. A "dist" object is returned if selection is NULL, otherwise, a matrix is returned.

the top n highest-ranking items will be returned. If n is NULL, return all items. Useful if the output object will be coerced into a list, for instance if the top n most similar features to a target feature is desired. (See examples.)

margin

identifies the margin of the dfm on which similarity or difference will be computed: documents for documents or features for word/term features.

method

method the similarity or distance measure to be used; see Details

upper

whether the upper triangle of the symmetric \(V \times V\) matrix is recorded

diag

whether the diagonal of the distance matrix should be recorded

The power of the Minkowski distance.

Details

textstat_dist options are: "euclidean" (default), "Chisquared", "Chisquared2", "hamming", "kullback". "manhattan", "maximum", "canberra", and "minkowski".

textstat_simil options are: "correlation" (default), "cosine", "jaccard", "eJaccard", "dice", "eDice", "simple matching", "hamann", and "faith".

Examples

Run this code

# create a dfm from inaugural addresses from Reagan onwards
presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1990), 
               remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
               
# distances for documents 
(d1 <- textstat_dist(presDfm, margin = "documents"))
as.matrix(d1)

# distances for specific documents
textstat_dist(presDfm, "2017-Trump", margin = "documents")
textstat_dist(presDfm, "2005-Bush", margin = "documents", method = "eJaccard")
(d2 <- textstat_dist(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents"))
as.list(d1)

# similarities for documents
(s1 <- textstat_simil(presDfm, method = "cosine", margin = "documents"))
as.matrix(s1)
as.list(s1)

# similarities for for specific documents
textstat_simil(presDfm, "2017-Trump", margin = "documents")
textstat_simil(presDfm, "2017-Trump", method = "cosine", margin = "documents")
textstat_simil(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents")

# compute some term similarities
(s2 <- textstat_simil(presDfm, c("fair", "health", "terror"), method = "cosine", 
                      margin = "features", n = 8))
as.list(s2)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Details

See Also

Examples