varselbest: Variable selection for specifying conditional imputation models

Description

varselbest performs variable selection from an incomplete dataset (see Bar-Hen and Audigier (2022) <doi:10.1080/00949655.2022.2070621>) in order to specify the imputation models to use for FCS imputation methods

Usage

varselbest(
  data.na = NULL,
  res.imputedata = NULL,
  listvar = NULL,
  nb.clust = NULL,
  nnodes = 1,
  sizeblock = 5,
  method.select = "knockoff",
  B = 200,
  r = 0.3,
  graph = TRUE,
  printflag = TRUE,
  path.outfile = NULL,
  mar = c(2, 4, 2, 0.5) + 0.1,
  cex.names = 0.7,
  modelNames = NULL
)

Value

a list of four objects

predictormatrix: a numeric matrix containing 0 and 1 specifying on each line the set of predictors to be used for each target column of the incomplete dataset.
res.varsel: a list given details on the variable selection procedure (only required for checking convergence by the chooseB function)
proportion: a numeric matrix of proportion indicating on each line the variable importance of each predictor
call: the matching call

Arguments

data.na: a dataframe with only numeric variables
res.imputedata: an output from imputedata
listvar: a character vector indicating for which subset of incomplete variables variable selection must be performed. By default all column names.
nb.clust: the number of clusters used for imputation
nnodes: number of CPU cores for parallel computing. By default, nnodes = 1
sizeblock: an integer indicating the number of variables sampled at each iteration
method.select: a single string indicating the variable selection method applied on each subset of variables
B: number of iterations, by default B = 200
r: a numerical vector (or a single real number) indicating the threshold used for each variable in listvar. Each value of r should be between 0 and 1. See details.
graph: a boolean. If TRUE two graphics are plotted per variable in listvar: a graphic reporting the variable importance measure of each explanatory variable and a graphic reporting the influence of the number iterations (B) on the importance measures
printflag: a boolean. If TRUE, a message is printed at each iteration. Use printflag = FALSE for silent selection.
path.outfile: a vector of strings indicating the path for redirection of print messages. Default value is NULL, meaning that silent imputation is performed. Otherwise, print messages are saved in the files path.outfile/output.txt. One file per node is generated.
mar: a numerical vector of the form c(bottom, left, top, right). Only used if graph = TRUE
cex.names: expansion factor for axis names (bar labels) (only used if graph = TRUE)
modelNames: a vector of character strings indicating the models to be fitted in the EM phase of clustering

Details

varselbest performs variable selection on random subsets of variables and, then, combines them to recover which explanatory variables are related to the response. More precisely, the outline of the algorithm are as follows: let consider a random subset of sizeblock among p variables. By choosing sizeblock small, this subset is low dimensional, allowing treatment of missing values by standard imputation method for clustered individuals. Then, any selection variable scheme can be applied (lasso, stepwise and knockoff are proposed by tuning the method.select argument). By resampling B times, a sample of size sizeblock among the p variables, we may count how many times, a variable is considered as significantly related to the response and how many times it is not. We need to define a threshold (r) to conclude if a given variable is significantly related to the response.

References

Bar-Hen, A. and Audigier, V., An ensemble learning method for variable selection: application to high dimensional data and missing values, Journal of Statistical Computation and Simulation, <doi:10.1080/00949655.2022.2070621>, 2022.

Examples

Run this code

data(wine, package = "clusterMI")

require(parallel)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)

# \donttest{
nnodes <- 2 # parallel::detectCores()
B <- 150 #  Number of iterations
m <- 5 # Number of imputed data sets

# variable selection
res.varsel <- varselbest(data.na = wine.na,
                         nb.clust = nb.clust,
                         listvar = c("alco","malic"),
                         B = B,
                         nnodes = nnodes)
predictmat <- res.varsel$predictormatrix

# imputation
res.imp.select <- imputedata(data.na = wine.na, method = "FCS-homo",
                     nb.clust = nb.clust, predictmat = predictmat, m = m)
# }

Run the code above in your browser using DataLab