Learn R Programming

largeVis (version 0.2.1.1)

projectKNNs: Project a distance matrix into a lower-dimensional space.

Description

Takes as input a sparse matrix of the edge weights connecting each node to its nearest neighbors, and outputs a matrix of coordinates embedding the inputs in a lower-dimensional space.

Usage

projectKNNs(wij, dim = 2, sgd_batches = NULL, M = 5, gamma = 7,
  alpha = 1, rho = 1, coords = NULL, useDegree = FALSE,
  momentum = NULL, seed = NULL, threads = NULL,
  verbose = getOption("verbose", TRUE))

Arguments

wij

A symmetric sparse matrix of edge weights, in C-compressed format, as created with the Matrix package.

dim

The number of dimensions for the projection space.

sgd_batches

The number of edges to process during SGD. Defaults to a value set based on the size of the dataset. If the parameter given is between 0 and 1, the default value will be multiplied by the parameter.

M

The number of negative edges to sample for each positive edge.

gamma

The strength of the force pushing non-neighbor nodes apart.

alpha

Hyperparameter used in the default distance function, \(1 / (1 + \alpha \dot ||y_i - y_j||^2)\). The function relates the distance between points in the low-dimensional projection to the likelihood that the two points are nearest neighbors. Increasing \(\alpha\) tends to push nodes and their neighbors closer together; decreasing \(\alpha\) produces a broader distribution. Setting \(\alpha\) to zero enables the alternative distance function. \(\alpha\) below zero is meaningless.

rho

Initial learning rate.

coords

An initialized coordinate matrix.

useDegree

Whether to use vertex degree to determine weights in negative sampling (if TRUE), or the sum of the vertex's edges (the default). See Notes.

momentum

If not NULL (the default), SGD with momentum is used, with this multiplier, which must be between 0 and 1. Note that momentum can drastically speed-up training time, at the cost of additional memory consumed.

seed

Random seed to be passed to the C++ functions; sampled from hardware entropy pool if NULL (the default). Note that if the seed is not NULL (the default), the maximum number of threads will be set to 1 in phases of the algorithm that would otherwise be non-deterministic.

threads

The maximum number of threads to spawn. Determined automatically if NULL (the default).

verbose

Verbosity

Value

A dense [N,D] matrix of the coordinates projecting the w_ij matrix into the lower-dimensional space.

Details

The algorithm attempts to estimate a dim-dimensional embedding using stochastic gradient descent and negative sampling.

The objective function is: $$ O = \sum_{(i,j)\in E} w_{ij} (\log f(||p(e_{ij} = 1||) + \sum_{k=1}^{M} E_{jk~P_{n}(j)} \gamma \log(1 - f(||p(e_{ij_k} - 1||)))$$ where \(f()\) is a probabilistic function relating the distance between two points in the low-dimensional projection space, and the probability that they are nearest neighbors.

The default probabilistic function is \(1 / (1 + \alpha \dot ||x||^2)\). If \(\alpha\) is set to zero, an alternative probabilistic function, \(1 / (1 + \exp(x^2))\) will be used instead.

Note that the input matrix should be symmetric. If any columns in the matrix are empty, the function will fail.

Examples

Run this code
# NOT RUN {
data(CO2)
CO2$Plant <- as.integer(CO2$Plant)
CO2$Type <- as.integer(CO2$Type)
CO2$Treatment <- as.integer(CO2$Treatment)
co <- scale(as.matrix(CO2))
# Very small datasets often produce a warning regarding the alias table.  This is safely ignored.
suppressWarnings(vis <- largeVis(t(co), K = 20, sgd_batches = 1, threads = 2))
suppressWarnings(coords <- projectKNNs(vis$wij, threads = 2))
plot(t(coords))
# }

Run the code above in your browser using DataLab