kNN: k-Nearest Neighbour Imputation

Description

k-Nearest Neighbour Imputation based on a variation of the Gower Distance for numerical, categorical, ordered and semi-continous variables.

Usage

kNN(
  data,
  variable = colnames(data),
  k = 5,
  dist_var = colnames(data),
  weights = NULL,
  numFun = median,
  catFun = maxCat,
  makeNA = NULL,
  NAcond = NULL,
  impNA = TRUE,
  donorcond = NULL,
  mixed = vector(),
  mixed.constant = NULL,
  trace = FALSE,
  imp_var = TRUE,
  imp_suffix = "imp",
  addRF = FALSE,
  onlyRF = FALSE,
  addRandom = FALSE,
  useImputedDist = TRUE,
  weightDist = FALSE,
  methodStand = "range",
  ordFun = medianSamp
)

Value

the imputed data set.

Arguments

data: data.frame or matrix
variable: variables where missing values should be imputed
k: number of Nearest Neighbours used
dist_var: names or variables to be used for distance calculation
weights: weights for the variables for distance calculation. If weights = "auto" weights will be selected based on variable importance from random forest regression, using function ranger::ranger(). Weights are calculated for each variable seperately.
numFun: function for aggregating the k Nearest Neighbours in the case of a numerical variable
catFun: function for aggregating the k Nearest Neighbours in the case of a categorical variable
makeNA: list of length equal to the number of variables, with values, that should be converted to NA for each variable
NAcond: list of length equal to the number of variables, with a condition for imputing a NA
impNA: TRUE/FALSE whether NA should be imputed
donorcond: list of length equal to the number of variables, with a donorcond condition as character string. e.g. a list element can be ">5" or c(">5","<10). If the list element for a variable is NULL no condition will be applied for this variable.
mixed: names of mixed variables
mixed.constant: vector with length equal to the number of semi-continuous variables specifying the point of the semi-continuous distribution with non-zero probability
trace: TRUE/FALSE if additional information about the imputation process should be printed
imp_var: TRUE/FALSE if a TRUE/FALSE variables for each imputed variable should be created show the imputation status
imp_suffix: suffix for the TRUE/FALSE variables showing the imputation status
addRF: TRUE/FALSE each variable will be modelled using random forest regression (ranger::ranger()) and used as additional distance variable.
onlyRF: TRUE/FALSE if TRUE only additional distance variables created from random forest regression will be used as distance variables.
addRandom: TRUE/FALSE if an additional random variable should be added for distance calculation
useImputedDist: TRUE/FALSE if an imputed value should be used for distance calculation for imputing another variable. Be aware that this results in a dependency on the ordering of the variables.
weightDist: TRUE/FALSE if the distances of the k nearest neighbours should be used as weights in the aggregation step
methodStand: either "range" or "iqr" to be used in the standardization of numeric vaiables in the gower distance
ordFun: function for aggregating the k Nearest Neighbours in the case of a ordered factor variable

Author

Alexander Kowarik, Statistik Austria

References

A. Kowarik, M. Templ (2016) Imputation with R package VIM. Journal of Statistical Software, 74(7), 1-16.

Examples

Run this code


data(sleep)
kNN(sleep)
library(laeken)
kNN(sleep, numFun = weightedMean, weightDist=TRUE)

Run the code above in your browser using DataLab