nearest.neighbours: Find Nearest Neighbours in DSM Space (wordspace)

Description

Find the nearest neighbours of a term vector in a DSM, given either as a scored cooccurrence matrix or a pre-computed distance matrix. The target term can be selected by name (in which case the cooccurrence or distance matrix must be labelled appropriately) or specified as a vector (if the DSM is given as a matrix).

Usage

nearest.neighbours(M, term, n = 10, M2 = NULL, byrow = TRUE,
                   drop = TRUE, skip.missing = FALSE, dist.matrix = FALSE,
                   ..., batchsize=50e6, verbose=FALSE)

Value

A list with one entry for each target term found in M, giving

dist.matrix=FALSE (default): the nearest neighbours as a numeric vector of distances or similarities labelled with the corresponding terms and ordered by distance
dist.matrix=TRUE: a full distance or similarity matrix for the target term and its nearest neighbours (as an object of class dist.matrix). An additional attribute selected contains a logical vector indicating the position of the target term in the matrix.

If drop=TRUE, a list containing only a single target term will be simplified to a plain vector or distance matrix.

Arguments

M: either a dense or sparse matrix representing a scored DSM (or an object of class dsm), or a pre-computed distance matrix returned by dist.matrix (as an object of class dist.matrix). Note that the compact representation produced by the dist function (class dist) is not accepted.
term: either a character vector specifying one or more target terms for which nearest neighbours will be found, or a matrix specifying the target vectors directly. A plain vector is interpreted as a single-row matrix.
n: an integer giving the number of nearest neighbours to be returned for each target term
M2: an optional dense or sparse matrix (or object of class dsm). If specified, nearest neighbours are found among the rows (default) or columns (byrow=FALSE) of M2, allowing for NN search in a cross-distance setting.
byrow: whether target terms are looked up in rows (default) or columns (byrow=FALSE) of M. NB: Target vectors in the term argument are always given as row vectors, even if byrow=FALSE.
drop: if TRUE, the return value is simplified to a vector (or distance matrix) if it contains nearest neighbours for exactly one target term (default). Set drop=FALSE to ensure that nearest.neighbours always returns a list.
skip.missing: if TRUE, silently ignores target terms not found in the DSM or distance matrix. By default (skip.missing=FALSE) an error is raised in this case.
dist.matrix: if TRUE, return a full distance matrix between the target term and its nearest neighbours (instead of a vector of neighbours). Note that a pre-computed distance matrix M must be symmetric in this case.
...: additional arguments are passed to dist.matrix if M is a scored DSM matrix. See the manpage of dist.matrix for details on available parameters and settings.
batchsize: if term is a long list of lookup terms, it will automatically be processed in batches. The number of terms per batch is chosen in such a way that approximately batchsize intermediate similarity values have to be computed and stored at a time (not used if M is a pre-computed distance matrix).
verbose: if TRUE, display some progress messages indicating how data are split into batches

Author

Stephanie Evert (https://purl.org/stephanie.evert)

Details

In most cases, the target term itself is automatically excluded from the list of neighbours. There are two exceptions:

The target term is given as a vector rather than by name.
Nearest neighbours are determined in a cross-distance setting. This is the case if (i) M2 is specified or (ii) M is a pre-computed distance matrix and not marked to be symmetric.

With dist.matrix=TRUE, the returned distance matrix always includes the target term.

M can also be a pre-computed distance or similarity matrix from an external source, which must be marked with as.distmat. If M is a sparse similarity matrix, only non-zero cells will be considered when looking for the nearest neighbours. Keep in mind that dist.matrix=TRUE is only valid if M is a symmetric matrix and marked as such.

Examples

Run this code


nearest.neighbours(DSM_Vectors, c("apple_N", "walk_V"), n=10)

nearest.neighbours(DSM_Vectors, "apple_N", n=10, method="maximum")

as.dist(nearest.neighbours(DSM_Vectors, "apple_N", n=10, dist.matrix=TRUE))

Run the code above in your browser using DataLab