lle(X, m, k, reg = 2, ss = FALSE, p = 0.5, id = FALSE,
nnk = TRUE, eps = 1, iLLE = FALSE, v = 0.99)
calc_k
. ss==TRUE
) index vector of kept data while subset selection id==TRUE
) vector of intrinsic dimension for every data point.plot_lle
.
If id
is true, the intrinsic dimension of the data is automatically calculated during the execution of the function. Since the intrinsic dimension is calculated for every data point $x_i$ the result of this calculation consists of a vector with length $N$. The used approach is to calculate the mean and the mode of this vector as represention of the overall intrinsic dimension of the data.
The reg
parameter allows the decision between different regularisation methods. As one step of the LLE algorithm, the inverse of the Gram-matrix $G\in R^{kxk}$ has to be calculated. The rank of $G$ equals $m$ which is mostly smaller than $k$ - this is why a regularisation $G^{(i)}+r\cdot I$ should be performed. The calculation of regularisation parameter $r$ can be done using different methods:
reg=1
: standardized sum of eigenvalues of $G$, see Ref. 1), Ch. 3.2
reg=2
: trace of Gram-matrix divided by $k$, see Ref. 2), Ch. 5.2
reg=3
: constant value 3*10e-3
There is no theoretical evidence which method is best to use but several empirical analyses have shown that method #2 works the most reliable.
The most time-consuming step of LLE consists in the calculation of the eigenvalues and -vectors of matrix $M\in R^{NxN}$ in the find_coords
function. To reduce the dimension of matrix $M$, which means to reduce the number of samples $N$ in a reliable way, Ref. 1 proposes a subset selection algorithm, which is integrated in the lle
function. The amount of data that is kept is represented by parameter p
.
Improved LLE (iLLE
) is an extension of the LLE algorithm described in Ref. 3. It raises the required amount of memory and time, but makes the algoritm less dependent on the number of neighbours.
Calculating the intrinsic dimension strongly depends on a threshold value $v$. The best value for this parameter depends on the origin of the data. For very accurate data a value beyond 0.99 is propose, for very raw data a value of 0.9 is proposed. This parameter should be varied if a specific intrinsic dimension is expected and other results are calculated. Higher values of $v$ lead to a higher number of calculate intrinsic dimensions.
# perform LLE
data( lle_scurve_data )
X <- lle_scurve_data
results <- lle( X=X, m=2, k=12, reg=2, ss=FALSE, id=TRUE, v=0.9 )
str( results )
# plot results and intrinsic dimension (manually)
split.screen( c(2,1) )
screen(1)
plot( results$Y, main="embedded data", xlab=expression(y[1]), ylab=expression(y[2]) )
screen(2)
plot( results$id, main="intrinsic dimension", type="l", xlab=expression(x[i]), ylab="id", lwd=2 )
Run the code above in your browser using DataLab