spe: Implements the stochastic proximity embedding algorithm

Description

Embeds an N dimensional dataset in M dimensions, such that distances (or similarities) in the original N dimensions are maintained (as close as possible) in the final M dimensions

Usage

spe( coord, rcutpercent = 1, maxdist = 0, nobs = 0, ndim = 0, edim, lambda0 = 2.0, lambda1 = 0.01, nstep = 1e6, ncycle = 100,  evalstress=FALSE, sampledist=TRUE, samplesize = 1e6)

Arguments

coord

This should be a matrix with number of rows equal to the number of observations and number of columns equal to the input dimension. A data.frame may also be supplied and it will be converted to a matrix (so all names will be lost)

rcutpercent

This is the percentage of the maximum distance (as determined by probability sampling) that will be used as the neighborhood radius. Setting rcutpercent to a value greater than 1 effectively sets it to infinity.

maxdist

If you have alread calculated a mxaimum distance then you can supply it and probability sampling will not be carried out to obtain a maximum distance. The default is to carry out sampling. By setting maxdist to a non zero value sampling will not be carried out (even if sampledist=TRUE)

nobs

The number of observations. If it is not specified nobs will be taken as nrow(coord)

ndim

The number of input dimensions. If not specified it will be taken as ncol(coord)

edim

The number of dimensions to embed in

lambda0

The starting value of the learning parameter

lambda1

The ending value of the learning parameter

nstep

The number of refinement steps

ncycle

The number of cycles to carry out refinement for

evalstress

If TRUE the function will evaluate the Sammon stress on the final embedding

sampledist

If TRUE an approximation to the maximum distance in the input dimensions will be obtained via probability sampling

samplesize

The number of iterations for probability sampling. For a dataset of 6070 observations there will be 6070x6069/2 pairwise distances. The default value gives a close approximation and runs fast. If you want a bettr approximation 1e7 is a good value. YMMV

Value

If evalstress is TRUE it will be a list with two components named x and stress. x is the matrix of the final embedding and stress is the final stress

Details

Efficient determination of rcut is yet to be implemented (using the connected component method). As a result you will have to determine a value of rcutpercent by trail and error. The pivot SPE method (J. Mol. Graph. Model., 2003, 22, 133-140) is not yet implemented

References

A Self Organizing Principle for Learning Nonlinear Manifolds, Proc. Nat. Acad. Sci., 2002, 99, 15869-15872 Stochastic Proximity Embedding, J. Comput. Chem., 2003, 24, 1215-1221 A Modified Rule for Stochastic Proximity Embedding, J. Mol. Graph. Model., 2003, 22, 133-140 A Geodesic Framework for Analyzing Molecular Similarities, J. Chem. Inf. Comput. Sci., 2003, 43, 475-484

Examples

Run this code

## load the phone dataset
data(phone)

## run SPE, embed$stress should be 0 or very close to it
## You can plot the embedding using the scatterplot3d package
## (This will take a few minutes to run)
embed <- spe(phone, edim=3, evalstress=TRUE)

## evaluate the Sammon stress
stress <- eval.stress(embed$x, phone)

## embed the Swiss Roll dataset in 2D
data(swissroll)
embed <- spe(swissroll, edim=2, evalstress=TRUE)

Run the code above in your browser using DataLab