Find the nearest neighbours of a term vector in a DSM, given either as a scored cooccurrence matrix or a pre-computed distance matrix. The target term can be selected by name (in which case the cooccurrence or distance matrix must be labelled appropriately) or specified as a vector (if the DSM is given as a matrix).
nearest.neighbours(M, term, n = 10, M2 = NULL, byrow = TRUE,
drop = TRUE, skip.missing = FALSE, dist.matrix = FALSE,
..., batchsize=50e6, verbose=FALSE)
A list with one entry for each target term
found in M
, giving
dist.matrix=FALSE
(default): the nearest neighbours as a numeric vector of distances or similarities labelled with the corresponding terms and ordered by distance
dist.matrix=TRUE
: a full distance or similarity matrix for the target term and its nearest neighbours (as an object of class dist.matrix
). An additional attribute selected
contains a logical vector indicating the position of the target term in the matrix.
If drop=TRUE
, a list containing only a single target term will be simplified to a plain vector or distance matrix.
either a dense or sparse matrix representing a scored DSM (or an object of class dsm
), or a pre-computed distance matrix returned by dist.matrix
(as an object of class dist.matrix
). Note that the compact representation produced by the dist
function (class dist
) is not accepted.
either a character vector specifying one or more target terms for which nearest neighbours will be found, or a matrix specifying the target vectors directly. A plain vector is interpreted as a single-row matrix.
an integer giving the number of nearest neighbours to be returned for each target term
an optional dense or sparse matrix (or object of class dsm
). If specified, nearest neighbours are found among the rows (default) or columns (byrow=FALSE
) of M2
, allowing for NN search in a cross-distance setting.
whether target terms are looked up in rows (default) or columns (byrow=FALSE
) of M
. NB: Target vectors in the term
argument are always given as row vectors, even if byrow=FALSE
.
if TRUE
, the return value is simplified to a vector (or distance matrix) if it contains nearest neighbours for exactly one target term (default). Set drop=FALSE
to ensure that nearest.neighbours
always returns a list.
if TRUE
, silently ignores target terms not found in the DSM or distance matrix. By default (skip.missing=FALSE
) an error is raised in this case.
if TRUE
, return a full distance matrix between the target term and its nearest neighbours (instead of a vector of neighbours). Note that a pre-computed distance matrix M
must be symmetric in this case.
additional arguments are passed to dist.matrix
if M
is a scored DSM matrix. See the manpage of dist.matrix
for details on available parameters and settings.
if term
is a long list of lookup terms, it will automatically be processed in batches. The number of terms per batch is chosen in such a way that approximately batchsize
intermediate similarity values have to be computed and stored at a time (not used if M
is a pre-computed distance matrix).
if TRUE
, display some progress messages indicating how data are split into batches
Stephanie Evert (https://purl.org/stephanie.evert)
In most cases, the target term itself is automatically excluded from the list of neighbours. There are two exceptions:
The target term is given as a vector rather than by name.
Nearest neighbours are determined in a cross-distance setting. This is the case if (i) M2
is specified or (ii) M
is a pre-computed distance matrix and not marked to be symmetric.
With dist.matrix=TRUE
, the returned distance matrix always includes the target term.
M
can also be a pre-computed distance or similarity matrix from an external source, which must be marked with as.distmat
. If M
is a sparse similarity matrix, only non-zero cells will be considered when looking for the nearest neighbours. Keep in mind that dist.matrix=TRUE
is only valid if M
is a symmetric matrix and marked as such.
dist.matrix
for more information on available distance metrics and similarity measures
nearest.neighbours(DSM_Vectors, c("apple_N", "walk_V"), n=10)
nearest.neighbours(DSM_Vectors, "apple_N", n=10, method="maximum")
as.dist(nearest.neighbours(DSM_Vectors, "apple_N", n=10, dist.matrix=TRUE))
Run the code above in your browser using DataLab