Fast similarity/distance computation function for large sparse matrices. You
can floor small similarity value to to save computation time and storage
space by an arbitrary threshold (min_simil
) or rank (rank
). You
can specify the number of threads for parallel computing via
options(proxyC.threads)
.
simil(
x,
y = NULL,
margin = 1,
method = c("cosine", "correlation", "jaccard", "ejaccard", "fjaccard", "dice", "edice",
"hamann", "faith", "simple matching"),
min_simil = NULL,
rank = NULL,
drop0 = FALSE,
diag = FALSE,
use_nan = NULL,
digits = 14
)dist(
x,
y = NULL,
margin = 1,
method = c("euclidean", "chisquared", "kullback", "jeffreys", "jensen", "manhattan",
"maximum", "canberra", "minkowski", "hamming"),
p = 2,
smooth = 0,
drop0 = FALSE,
diag = FALSE,
use_nan = NULL,
digits = 14
)
matrix or Matrix object. Dense matrices are covered to the CsparseMatrix-class internally.
if a matrix or Matrix object is provided, proximity
between documents or features in x
and y
is computed.
integer indicating margin of similarity/distance computation. 1 indicates rows or 2 indicates columns.
method to compute similarity or distance
the minimum similarity value to be recorded.
an integer value specifying top-n most similarity values to be recorded.
if TRUE
, zero values are removed regardless of
min_simil
or rank
.
if TRUE
, only compute diagonal elements of the
similarity/distance matrix; useful when comparing corresponding rows or
columns of x
and y
.
if TRUE
, return NaN
if the standard deviation of a vector
is zero when method
is "correlation"; if all the values are zero in a
vector when method
is "cosine", "chisquared", "kullback", "jeffreys" or
"jensen". Note that use of NaN
makes the similarity/distance matrix
denser and therefore larger in RAM. If FALSE
, return zero in same use
situations as above. If NULL
, will also return zero but also generate a
warning (default).
determines rounding of small values towards zero. Use primarily to correct rounding errors in C++. See zapsmall.
weight for Minkowski distance
adds a fixed value to all the cells to avoid division by zero.
Only used when method
is "chisquared", "kullback", "jeffreys" or "jensen".
Similarity:
cosine
: cosine similarity
correlation
: Pearson's correlation
jaccard
: Jaccard coefficient
ejaccard
: the real value version of jaccard
fjaccard
: Fuzzy Jaccard coefficient
dice
: Dice coefficient
edice
: the real value version of dice
hamann
: Hamann similarity
faith
: Faith similarity
simple matching
: the percentage of common elements
Distance:
euclidean
: Euclidean distance
chisquared
: chi-squared distance
kullback
: Kullback–Leibler divergence
jeffreys
: Jeffreys divergence
jensen
: Jensen–Shannon divergence
manhattan
: Manhattan distance
maximum
: the largest difference between values
canberra
: Canberra distance
minkowski
: Minkowski distance
hamming
: Hamming distance
See the vignette for how the similarity and distance are computed:
vignette("measures", package = "proxyC")
It performs parallel computing using Intel oneAPI Threads Building Blocks.
The number of threads for parallel computing should be specified via
options(proxyC.threads)
before calling the functions. If the value is -1,
all the available threads will be used. Unless the option is used, the
number of threads will be limited by the environmental variables
(OMP_THREAD_LIMIT
or RCPP_PARALLEL_NUM_THREADS
) to comply with CRAN
policy and offer backward compatibility.
zapsmall
mt <- Matrix::rsparsematrix(100, 100, 0.01)
simil(mt, method = "cosine")[1:5, 1:5]
mt <- Matrix::rsparsematrix(100, 100, 0.01)
dist(mt, method = "euclidean")[1:5, 1:5]
Run the code above in your browser using DataLab