- data
A data.frame (or coercible) object, or an object of class midas_pre
created from rMIDAS::convert()
- binary_columns
A vector of column names, containing binary variables. NOTE: if data
is a midas_pre
object, this argument will be overwritten.
- softmax_columns
A list of lists, each internal list corresponding to a single categorical variable and containing names of the one-hot encoded variable names. NOTE: if data
is a midas_pre
object, this argument will be overwritten.
- spikein
A numeric between 0 and 1; the proportion of observed values in the input dataset to be randomly removed.
- training_epochs
An integer, specifying the number of overimputation training epochs.
- report_ival
An integer, specifying the number of overimputation training epochs between calculations of loss. Shorter intervals provide a more granular view of model performance but slow down the overimputation process.
- plot_vars
Boolean, specifies whether to plot the distribution of original versus overimputed values. This takes the form of a density plot for continuous variables and a barplot for categorical variables (showing proportions of each class).
- skip_plot
Boolean, specifies whether to suppress the main graphical output. This may be desirable when users are conducting a series of overimputation exercises and are primarily interested in the console output. Note, when skip_plot = FALSE
, users must manually close the resulting pyplot window before the code will terminate.
- spike_seed, seed
An integer, to initialize the pseudo-random number generators. Separate seeds can be provided for the spiked-in missingness and imputation, otherwise spike_seed
is set to seed
(default = 123L).
- save_path
String, indicating path to directory to save overimputation figures. Users should include a trailing "/" at the end of the path i.e. save_path = "path/to/figures/".
- layer_structure
A vector of integers, The number of nodes in each layer of the network (default = c(256, 256, 256)
, denoting a three-layer network with 256 nodes per layer). Larger networks can learn more complex data structures but require longer training and are more prone to overfitting.
- learn_rate
A number, the learning rate \(\gamma\) (default = 0.0001), which controls the size of the weight adjustment in each training epoch. In general, higher values reduce training time at the expense of less accurate results.
- input_drop
A number between 0 and 1. The probability of corruption for input columns in training mini-batches (default = 0.8). Higher values increase training time but reduce the risk of overfitting. In our experience, values between 0.7 and 0.95 deliver the best performance.
- train_batch
An integer, the number of observations in training mini-batches (default = 16).
- latent_space_size
An integer, the number of normal dimensions used to parameterize the latent space.
- cont_adj
A number, weights the importance of continuous variables in the loss function
- binary_adj
A number, weights the importance of binary variables in the loss function
- softmax_adj
A number, weights the importance of categorical variables in the loss function
- dropout_level
A number between 0 and 1, determines the number of nodes dropped to "thin" the network
- vae_layer
Boolean, specifies whether to include a variational autoencoder layer in the network
- vae_alpha
A number, the strength of the prior imposed on the Kullback-Leibler divergence term in the variational autoencoder loss functions.
- vae_sample_var
A number, the sampling variance of the normal distributions used to parameterize the latent space.