textstat_simil: Similarity and distance computation between documents or features

Description

These functions compute matrixes of distances and similarities between documents or features from a dfm() and return a matrix of similarities or distances in a sparse format. These methods are fast and robust because they operate directly on the sparse dfm objects. The output can easily be coerced to an ordinary matrix, a data.frame of pairwise comparisons, or a dist format.

Usage

textstat_simil(
  x,
  y = NULL,
  selection = NULL,
  margin = c("documents", "features"),
  method = c("correlation", "cosine", "jaccard", "ejaccard", "dice", "edice", "hamman",
    "simple matching"),
  min_simil = NULL,
  ...
)
textstat_dist(
  x,
  y = NULL,
  selection = NULL,
  margin = c("documents", "features"),
  method = c("euclidean", "manhattan", "maximum", "canberra", "minkowski"),
  p = 2,
  ...
)
# S3 method for textstat_proxy
as.list(x, sorted = TRUE, n = NULL, diag = FALSE, ...)
# S3 method for textstat_proxy
as.data.frame(
  x,
  row.names = NULL,
  optional = FALSE,
  diag = FALSE,
  upper = FALSE,
  ...
)

Arguments

x, y

a dfm objects; y is an optional target matrix matching x in the margin on which the similarity or distance will be computed.

selection

(deprecated - use y instead).

margin

identifies the margin of the dfm on which similarity or difference will be computed: "documents" for documents or "features" for word/term features.

method

character; the method identifying the similarity or distance measure to be used; see Details.

min_simil

numeric; a threshold for the similarity values below which similarity values will not be returned

...

unused

The power of the Minkowski distance.

sorted

sort results in descending order if TRUE

the top n highest-ranking items will be returned. If n is NULL, return all items.

diag

logical; if FALSE, exclude the item's comparison with itself

row.names

NULL or a character vector giving the row names for the data frame. Missing values are not allowed.

optional

logical. If TRUE, setting row names and converting column names (to syntactic names: see make.names) is optional. Note that all of R's base package as.data.frame() methods use optional only for column names treatment, basically with the meaning of data.frame(*, check.names = !optional). See also the make.names argument of the matrix method.

upper

logical; if TRUE, return pairs as both (A, B) and (B, A)

Value

A sparse matrix from the Matrix package that will be symmetric unless y is specified.

These can be transformed easily into a list format using as.list(), which returns a list for each unique element of the second of the pairs, as.dist() to be transformed into a dist object, or as.matrix() to convert it into an ordinary matrix.

as.data.list for a textstat_simil or textstat_dist object returns a list equal in length to the columns of the simil or dist object, with the rows and their values as named elements. By default, this list excludes same-time pairs (when diag = FALSE) and sorts the values in descending order (when sorted = TRUE).

as.data.frame for a textstat_simil or textstat_dist object returns a data.frame of pairwise combinations and the and their similarity or distance value.

Details

textstat_simil options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", and "hamman".

textstat_dist options are: "euclidean" (default), "manhattan", "maximum", "canberra", and "minkowski".

Examples

Run this code

# NOT RUN {
# similarities for documents
dfmat <- dfm(corpus_subset(data_corpus_inaugural, Year > 2000),
             remove_punct = TRUE, remove = stopwords("english"))
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)
as.list(tstat1)
as.list(tstat1, diag = TRUE)

# min_simil
(tstat2 <- textstat_simil(dfmat, method = "cosine", margin = "documents", min_simil = 0.6))
as.matrix(tstat2)

# similarities for for specific documents
textstat_simil(dfmat, dfmat["2017-Trump", ], margin = "documents")
textstat_simil(dfmat, dfmat["2017-Trump", ], method = "cosine", margin = "documents")
textstat_simil(dfmat, dfmat[c("2009-Obama", "2013-Obama"), ], margin = "documents")

# compute some term similarities
tstat3 <- textstat_simil(dfmat, dfmat[, c("fair", "health", "terror")], method = "cosine",
                         margin = "features")
head(as.matrix(tstat3), 10)
as.list(tstat3, n = 6)


# distances for documents
(tstat4 <- textstat_dist(dfmat, margin = "documents"))
as.matrix(tstat4)
as.list(tstat4)
as.dist(tstat4)

# distances for specific documents
textstat_dist(dfmat, dfmat["2017-Trump", ], margin = "documents")
(tstat5 <- textstat_dist(dfmat, dfmat[c("2009-Obama" , "2013-Obama"), ], margin = "documents"))
as.matrix(tstat5)
as.list(tstat5)

# }
# NOT RUN {
# plot a dendrogram after converting the object into distances
plot(hclust(as.dist(tstat4)))
# }