Learn R Programming

quanteda (version 2.1.2)

textstat_simil: Similarity and distance computation between documents or features

Description

These functions compute matrixes of distances and similarities between documents or features from a dfm() and return a matrix of similarities or distances in a sparse format. These methods are fast and robust because they operate directly on the sparse dfm objects. The output can easily be coerced to an ordinary matrix, a data.frame of pairwise comparisons, or a dist format.

Usage

textstat_simil(
  x,
  y = NULL,
  selection = NULL,
  margin = c("documents", "features"),
  method = c("correlation", "cosine", "jaccard", "ejaccard", "dice", "edice", "hamman",
    "simple matching"),
  min_simil = NULL,
  ...
)

textstat_dist( x, y = NULL, selection = NULL, margin = c("documents", "features"), method = c("euclidean", "manhattan", "maximum", "canberra", "minkowski"), p = 2, ... )

# S3 method for textstat_proxy as.list(x, sorted = TRUE, n = NULL, diag = FALSE, ...)

# S3 method for textstat_proxy as.data.frame( x, row.names = NULL, optional = FALSE, diag = FALSE, upper = FALSE, ... )

Arguments

x, y

a dfm objects; y is an optional target matrix matching x in the margin on which the similarity or distance will be computed.

selection

(deprecated - use y instead).

margin

identifies the margin of the dfm on which similarity or difference will be computed: "documents" for documents or "features" for word/term features.

method

character; the method identifying the similarity or distance measure to be used; see Details.

min_simil

numeric; a threshold for the similarity values below which similarity values will not be returned

...

unused

p

The power of the Minkowski distance.

sorted

sort results in descending order if TRUE

n

the top n highest-ranking items will be returned. If n is NULL, return all items.

diag

logical; if FALSE, exclude the item's comparison with itself

row.names

NULL or a character vector giving the row names for the data frame. Missing values are not allowed.

optional

logical. If TRUE, setting row names and converting column names (to syntactic names: see make.names) is optional. Note that all of R's base package as.data.frame() methods use optional only for column names treatment, basically with the meaning of data.frame(*, check.names = !optional). See also the make.names argument of the matrix method.

upper

logical; if TRUE, return pairs as both (A, B) and (B, A)

Value

A sparse matrix from the Matrix package that will be symmetric unless y is specified.

These can be transformed easily into a list format using as.list(), which returns a list for each unique element of the second of the pairs, as.dist() to be transformed into a dist object, or as.matrix() to convert it into an ordinary matrix.

as.data.list for a textstat_simil or textstat_dist object returns a list equal in length to the columns of the simil or dist object, with the rows and their values as named elements. By default, this list excludes same-time pairs (when diag = FALSE) and sorts the values in descending order (when sorted = TRUE).

as.data.frame for a textstat_simil or textstat_dist object returns a data.frame of pairwise combinations and the and their similarity or distance value.

Details

textstat_simil options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", and "hamman".

textstat_dist options are: "euclidean" (default), "manhattan", "maximum", "canberra", and "minkowski".

See Also

stats::as.dist()

Examples

Run this code
# NOT RUN {
# similarities for documents
dfmat <- dfm(corpus_subset(data_corpus_inaugural, Year > 2000),
             remove_punct = TRUE, remove = stopwords("english"))
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)
as.list(tstat1)
as.list(tstat1, diag = TRUE)

# min_simil
(tstat2 <- textstat_simil(dfmat, method = "cosine", margin = "documents", min_simil = 0.6))
as.matrix(tstat2)

# similarities for for specific documents
textstat_simil(dfmat, dfmat["2017-Trump", ], margin = "documents")
textstat_simil(dfmat, dfmat["2017-Trump", ], method = "cosine", margin = "documents")
textstat_simil(dfmat, dfmat[c("2009-Obama", "2013-Obama"), ], margin = "documents")

# compute some term similarities
tstat3 <- textstat_simil(dfmat, dfmat[, c("fair", "health", "terror")], method = "cosine",
                         margin = "features")
head(as.matrix(tstat3), 10)
as.list(tstat3, n = 6)


# distances for documents
(tstat4 <- textstat_dist(dfmat, margin = "documents"))
as.matrix(tstat4)
as.list(tstat4)
as.dist(tstat4)

# distances for specific documents
textstat_dist(dfmat, dfmat["2017-Trump", ], margin = "documents")
(tstat5 <- textstat_dist(dfmat, dfmat[c("2009-Obama" , "2013-Obama"), ], margin = "documents"))
as.matrix(tstat5)
as.list(tstat5)

# }
# NOT RUN {
# plot a dendrogram after converting the object into distances
plot(hclust(as.dist(tstat4)))
# }

Run the code above in your browser using DataLab