overimpute: Perform overimputation diagnostic test

Description

overimpute() spikes additional missingness into the input data and reports imputation accuracy at training intervals specified by the user. overimpute() works like train() -- users must specify input data, binary and categorical columns (if data is not generated via convert(), model parameters for the neural network, and then overimputation parameters (see below for full details).

Usage

overimpute(
  data,
  binary_columns = NULL,
  softmax_columns = NULL,
  spikein = 0.3,
  training_epochs,
  report_ival = 35,
  plot_vars = FALSE,
  skip_plot = FALSE,
  spike_seed = NULL,
  save_path = "",
  layer_structure = c(256, 256, 256),
  learn_rate = 4e-04,
  input_drop = 0.8,
  seed = 123L,
  train_batch = 16L,
  latent_space_size = 4,
  cont_adj = 1,
  binary_adj = 1,
  softmax_adj = 1,
  dropout_level = 0.5,
  vae_layer = FALSE,
  vae_alpha = 1,
  vae_sample_var = 1
)

Value

Object of class midas, and outputs both overimputation loss values to the console and generates overimputation graphs.

Arguments

data: A data.frame (or coercible) object, or an object of class midas_pre created from rMIDAS::convert()
binary_columns: A vector of column names, containing binary variables. NOTE: if data is a midas_pre object, this argument will be overwritten.
softmax_columns: A list of lists, each internal list corresponding to a single categorical variable and containing names of the one-hot encoded variable names. NOTE: if data is a midas_pre object, this argument will be overwritten.
spikein: A numeric between 0 and 1; the proportion of observed values in the input dataset to be randomly removed.
training_epochs: An integer, specifying the number of overimputation training epochs.
report_ival: An integer, specifying the number of overimputation training epochs between calculations of loss. Shorter intervals provide a more granular view of model performance but slow down the overimputation process.
plot_vars: Boolean, specifies whether to plot the distribution of original versus overimputed values. This takes the form of a density plot for continuous variables and a barplot for categorical variables (showing proportions of each class).
skip_plot: Boolean, specifies whether to suppress the main graphical output. This may be desirable when users are conducting a series of overimputation exercises and are primarily interested in the console output. Note, when skip_plot = FALSE, users must manually close the resulting pyplot window before the code will terminate.
spike_seed, seed: An integer, to initialize the pseudo-random number generators. Separate seeds can be provided for the spiked-in missingness and imputation, otherwise spike_seed is set to seed (default = 123L).
save_path: String, indicating path to directory to save overimputation figures. Users should include a trailing "/" at the end of the path i.e. save_path = "path/to/figures/".
layer_structure: A vector of integers, The number of nodes in each layer of the network (default = c(256, 256, 256), denoting a three-layer network with 256 nodes per layer). Larger networks can learn more complex data structures but require longer training and are more prone to overfitting.
learn_rate: A number, the learning rate \(\gamma\) (default = 0.0001), which controls the size of the weight adjustment in each training epoch. In general, higher values reduce training time at the expense of less accurate results.
input_drop: A number between 0 and 1. The probability of corruption for input columns in training mini-batches (default = 0.8). Higher values increase training time but reduce the risk of overfitting. In our experience, values between 0.7 and 0.95 deliver the best performance.
train_batch: An integer, the number of observations in training mini-batches (default = 16).
latent_space_size: An integer, the number of normal dimensions used to parameterize the latent space.
cont_adj: A number, weights the importance of continuous variables in the loss function
binary_adj: A number, weights the importance of binary variables in the loss function
softmax_adj: A number, weights the importance of categorical variables in the loss function
dropout_level: A number between 0 and 1, determines the number of nodes dropped to "thin" the network
vae_layer: Boolean, specifies whether to include a variational autoencoder layer in the network
vae_alpha: A number, the strength of the prior imposed on the Kullback-Leibler divergence term in the variational autoencoder loss functions.
vae_sample_var: A number, the sampling variance of the normal distributions used to parameterize the latent space.

Details

Accuracy is measured as the RMSE of imputed values versus actual values for continuous variables and classification error for categorical variables (i.e., the fraction of correctly predicted classes subtracted from 1). Both metrics are reported in two forms:

their summed value over all Monte Carlo samples from the estimated missing-data posterior -- "Aggregated RMSE" and "Aggregated softmax error'';
their aggregated value divided by the number of such samples -- "Individual RMSE" and "Individual softmax error".

In the final model, we recommend selecting the number of training epochs that minimizes the average value of these metrics --- weighted by the proportion (or substantive importance) of continuous and categorical variables --- in the overimputation exercise. This ``early stopping'' rule reduces the risk of overtraining and thus, in effect, serves as an extra layer of regularization in the network.

For more information, see Lall and Robinson (2023): doi:10.18637/jss.v107.i09.

References

rmidas_jssrMIDAS

Examples

Run this code

if (FALSE) {
# Run where Python initialised and configured correctly
if (python_configured()) {

raw_data <- data.table(a = sample(c("red","yellow","blue",NA),1000, replace = TRUE),
                         b = 1:1000,
                         c = sample(c("YES","NO",NA),1000,replace=TRUE),
                         d = runif(1000,1,10),
                         e = sample(c("YES","NO"), 1000, replace = TRUE),
                         f = sample(c("male","female","trans","other",NA), 1000, replace = TRUE))

# Names of bin./cat. variables
test_bin <- c("c","e")
test_cat <- c("a","f")

# Pre-process data
test_data <- convert(raw_data,
                       bin_cols = test_bin,
                       cat_cols = test_cat,
                       minmax_scale = TRUE)

# Overimpute - without plots
test_imp <- overimpute(test_data,
                       spikein = 0.3,
                       plot_vars = FALSE,
                       skip_plot = TRUE,
                       training_epochs = 10,
                       report_ival = 5)
}
}

Run the code above in your browser using DataLab