Compute cross-distances between collections of \(n\)-gram profiles.
textcat_xdist(x, p = NULL, method = "CT", ..., options = list())
a textcat profile db (see textcat_profile_db
),
or an R object of text documents extractable via
as.character
.
NULL
(default), or as for x
.
The default is equivalent to taking p
as x
(but more
efficient).
a character string specifying a built-in method, or a
user-defined function for computing distances between \(n\)-gram
profiles, or NULL
(corresponding to the current value of
textcat option xdist_method
(see
textcat_options
).
See Details for available built-in methods.
options to be passed to the method for computing distances.
a list of such options.
If x
(or p
) is not a profile db, the \(n\)-gram
profiles of the individual text documents extracted from it are
computed using the profile method and options in p
if this is a
profile db, and using the current textcat profile method and
options otherwise.
Currently, the following distance methods for \(n\)-gram profiles are available.
"CT"
:the out-of-place measure of Cavnar and Trenkle.
"ranks"
:a variant of the Cavnar/Trenkle measure based on the aggregated absolute difference of the ranks of the combined \(n\)-grams in the two profiles.
"ALPD"
:the sum of the absolute differences in \(n\)-gram log frequencies.
"KLI"
:the Kullback-Leibler I-divergence \(I(p, q) = \sum_i p_i \log(p_i/q_i)\) of the \(n\)-gram frequency distributions \(p\) and \(q\) of the two profiles.
"KLJ"
:the Kullback-Leibler J-divergence \(J(p, q) = \sum_i (p_i - q_i) \log(p_i/q_i)\), the symmetrized variant \(I(p, q) + I(q, p)\) of the I-divergences.
"JS"
:the Jensen-Shannon divergence between the \(n\)-gram frequency distributions.
"cosine"
the cosine dissimilarity between the profiles, i.e., one minus the inner product of the frequency vectors normalized to Euclidean length one (and filled with zeros for entries missing in one of the vectors).
"Dice"
the Dice dissimilarity, i.e., the fraction of \(n\)-grams present in one of the profiles only.
For the measures based on distances of frequency distributions,
\(n\)-grams of the two profiles are combined, and missing
\(n\)-grams are given a small positive absolute frequency which can
be controlled by option eps
, and defaults to 1e-6.
Options given in ...
and options
are combined, and
merged with the default xdist options specified by the textcat
option xdist_options
using exact name matching.
## Compute cross-distances between the TextCat byte profiles using the
## CT out-of-place measure.
d <- textcat_xdist(TC_byte_profiles)
## Visualize results of hierarchical cluster analysis on the distances.
plot(hclust(as.dist(d)), cex = 0.7)
Run the code above in your browser using DataLab