k-Nearest Neighbour Imputation based on a variation of the Gower Distance for numerical, categorical, ordered and semi-continous variables.
kNN(
data,
variable = colnames(data),
metric = NULL,
k = 5,
dist_var = colnames(data),
weights = NULL,
numFun = median,
catFun = maxCat,
makeNA = NULL,
NAcond = NULL,
impNA = TRUE,
donorcond = NULL,
mixed = vector(),
mixed.constant = NULL,
trace = FALSE,
imp_var = TRUE,
imp_suffix = "imp",
addRF = FALSE,
onlyRF = FALSE,
addRandom = FALSE,
useImputedDist = TRUE,
weightDist = FALSE,
methodStand = "range",
ordFun = medianSamp
)
the imputed data set.
data.frame or matrix
variables where missing values should be imputed
metric to be used for calculating the distances between
number of Nearest Neighbours used
names or variables to be used for distance calculation
weights for the variables for distance calculation.
If weights = "auto"
weights will be selected based on variable importance from random forest regression, using function ranger::ranger()
.
Weights are calculated for each variable seperately.
function for aggregating the k Nearest Neighbours in the case of a numerical variable
function for aggregating the k Nearest Neighbours in the case of a categorical variable
list of length equal to the number of variables, with values, that should be converted to NA for each variable
list of length equal to the number of variables, with a condition for imputing a NA
TRUE/FALSE whether NA should be imputed
list of length equal to the number of variables, with a donorcond condition as character string. e.g. a list element can be ">5" or c(">5","<10). If the list element for a variable is NULL no condition will be applied for this variable.
names of mixed variables
vector with length equal to the number of semi-continuous variables specifying the point of the semi-continuous distribution with non-zero probability
TRUE/FALSE if additional information about the imputation process should be printed
TRUE/FALSE if a TRUE/FALSE variables for each imputed variable should be created show the imputation status
suffix for the TRUE/FALSE variables showing the imputation status
TRUE/FALSE each variable will be modelled using random forest regression (ranger::ranger()
) and used as additional distance variable.
TRUE/FALSE if TRUE only additional distance variables created from random forest regression will be used as distance variables.
TRUE/FALSE if an additional random variable should be added for distance calculation
TRUE/FALSE if an imputed value should be used for distance calculation for imputing another variable. Be aware that this results in a dependency on the ordering of the variables.
TRUE/FALSE if the distances of the k nearest neighbours should be used as weights in the aggregation step
either "range" or "iqr" to be used in the standardization of numeric vaiables in the gower distance
function for aggregating the k Nearest Neighbours in the case of a ordered factor variable
Alexander Kowarik, Statistik Austria
A. Kowarik, M. Templ (2016) Imputation with R package VIM. Journal of Statistical Software, 74(7), 1-16.
Other imputation methods:
hotdeck()
,
impPCA()
,
irmi()
,
matchImpute()
,
medianSamp()
,
rangerImpute()
,
regressionImp()
,
sampleCat()
data(sleep)
kNN(sleep)
library(laeken)
kNN(sleep, numFun = weightedMean, weightDist=TRUE)
Run the code above in your browser using DataLab