It estimates the percentage of correct classification via an m-fold cross validation.
dirknn.tune(ina, x, k = 2:10, mesos = TRUE, nfolds = 10, folds = NULL,
parallel = FALSE, stratified = TRUE, seed = NULL, rann = FALSE, graph = FALSE)
A list including:
The average percent of correct classification across the neighbours.
The estimated (optimal) percent of correct classification.
The run time of the algorithm. A numeric vector. The first element is the user time, the second element is the system time and the third element is the elapsed time.
The data, a numeric matrix with unit vectors.
A variable indicating the groups of the data x.
How many folds to create?
A vector with the number of nearest neighbours to consider.
A boolean variable used only in the case of the non standard algorithm (type="NS"). Should the average of the distances be calculated (TRUE) or not (FALSE)? If it is FALSE, the harmonic mean is calculated.
Do you already have a list with the folds? If not, leave this NULL.
If you want the standard -NN algorithm to take place in parallel set this equal to TRUE.
Should the folds be created in a stratified way? i.e. keeping the distribution of the groups similar through all folds?
If seed is TRUE, the results will always be the same.
If you have large scale datasets and want a faster k-NN search, you can use kd-trees implemented in the R package "RANN". In this case you must set this argument equal to TRUE.
If this is TRUE a graph with the results will appear.
Michail Tsagris.
R implementation and documentation: Michail Tsagris mtsagris@uoc.gr.
The standard algorithm is to keep the k nearest observations and see the groups of these observations. The new observation is allocated to the most frequent seen group. The non standard algorithm is to calculate the classical mean or the harmonic mean of the k nearest observations for each group. The new observation is allocated to the group with the smallest mean distance.
We have made an eficient (not very much efficient though) memory allocation. Even if you have hundreds of thousands of observations, the computer will not clush, it will only take longer. Instead of calculate the distance matrix once in the beginning we calcualte the distances of the out-of-sample observations from the rest. If we calculated the distance matrix in the beginning, once, the resulting matrix could have dimensions thousands by thousands. This would not fit into the memory. If you have a few hundres of observations, the runtime is about the same (maybe less, maybe more) as calculating the distance matrix in the first place.
Tsagris M. and Alenazi A. (2019). Comparison of discriminant analysis methods on the sphere. Communications in Statistics: Case Studies, Data Analysis and Applications, 5(4), 467--491.
dirknn, dirda, mixvmf.mle
k <- runif(4, 4, 20)
prob <- c(0.2, 0.4, 0.3, 0.1)
mu <- matrix(rnorm(16), ncol = 4)
mu <- mu / sqrt( rowSums(mu^2) )
da <- rmixvmf(200, prob, mu, k)
x <- da$x
ina <- da$id
dirknn.tune(ina, x, k = 2:6, nfolds = 5, mesos = TRUE)
dirknn.tune(ina, x, k = 2:6, nfolds = 10, mesos = TRUE)
Run the code above in your browser using DataLab