These functions compute matrixes of distances and similarities between
documents or features from a dfm
and return a
dist
object (or a matrix if specific targets are
selected). They are fast and robust because they operate directly on the sparse
dfm objects.
textstat_dist_old(
x,
selection = NULL,
margin = c("documents", "features"),
method = "euclidean",
upper = FALSE,
diag = FALSE,
p = 2
)textstat_simil_old(
x,
selection = NULL,
margin = c("documents", "features"),
method = "correlation",
upper = FALSE,
diag = FALSE
)
a dfm object
a valid index for document or feature names from x
,
to be selected for comparison
identifies the margin of the dfm on which similarity or
difference will be computed: "documents"
for documents or
"features"
for word/term features
method the similarity or distance measure to be used; see Details
whether the upper triangle of the symmetric \(V \times V\) matrix is recorded
whether the diagonal of the distance matrix should be recorded
The power of the Minkowski distance.
textstat_simil
and textstat_dist
return
dist
class objects if selection is NULL
, otherwise, a
matrix is returned matching distances to the documents or features
identified in the selection.
textstat_dist
options are: "euclidean"
(default),
"chisquared"
, "chisquared2"
,
"kullback"
. "manhattan"
, "maximum"
, "canberra"
,
and "minkowski"
.
textstat_simil
options are: "correlation"
(default),
"cosine"
, "jaccard"
, "ejaccard"
, "dice"
,
"edice"
, "simple matching"
, "hamman"
, and
"faith"
.
The "chisquared"
metric is from Legendre, P., & Gallagher,
E. D. (2001).
"Ecologically
meaningful transformations for ordination of species data".
Oecologia, 129(2), 271-280. doi.org/10.1007/s004420100716
The "chisquared2"
metric is the "Quadratic-Chi" measure from Pele,
O., & Werman, M. (2010).
"The
Quadratic-Chi Histogram Distance Family". In Computer Vision - ECCV
2010 (Vol. 6312, pp. 749-762). Berlin, Heidelberg: Springer, Berlin,
Heidelberg. doi.org/10.1007/978-3-642-15552-9_54.
"kullback"
is the Kullback-Leibler distance, which assumes that
\(P(x_i) = 0\) implies \(P(y_i)=0\), and in case both \(P(x_i)\) and
\(P(y_i)\) equals to zero, then \(P(x_i) * log(p(x_i)/p(y_i))\) is
assumed to be zero as the limit value. The formula is:
$$\sum{P(x)*log(P(x)/p(y))}$$
All other measures are described in the proxy package.