Learn R Programming

LEGIT (version 1.4.1)

bootstrap_var_select: Bootstrap variable selection (for IMLEGIT)

Description

[Very slow, not recommended] Creates bootstrap samples, runs a stepwise search on all of them and then reports the percentage of times that each variable was selected. This is very computationally demanding. With small sample sizes, variable selection can be unstable and bootstrap can be used to give us an idea of the degree of certitude that a variable should be included or not.

Usage

bootstrap_var_select(
  data,
  formula,
  boot_iter = 1000,
  boot_size = NULL,
  boot_group = NULL,
  latent_var_original = NULL,
  latent_var_extra = NULL,
  search_type = "bidirectional-forward",
  search = 0,
  search_criterion = "AIC",
  forward_exclude_p_bigger = 0.2,
  backward_exclude_p_smaller = 0.01,
  exclude_worse_AIC = TRUE,
  max_steps = 100,
  start_latent_var = NULL,
  eps = 0.01,
  maxiter = 100,
  family = gaussian,
  ylim = NULL,
  seed = NULL,
  progress = TRUE,
  n_cluster = 1,
  best_subsets = 5,
  test_only = FALSE
)

Value

Returns a list of vectors containing the percentage of times that each variable was selected within each latent variable.

Arguments

data

data.frame of the dataset to be used.

formula

Model formula. The names of latent_var can be used in the formula to represent the latent variables. If names(latent_var) is NULL, then L1, L2, ... can be used in the formula to represent the latent variables. Do not manually code interactions, write them in the formula instead (ex: G*E1*E2 or G:E1:E2).

boot_iter

number of bootstrap samples (Default = 1000).

boot_size

Optional size of the bootstrapped samples (Default = number of observations).

boot_group

Optional vector which represents the group associated with each observation. Sampling will be done by group instead of by observations (very important if you have longitudinal data). The sample sizes of the bootstrap samples might differ by up to "boot_size - maximum group size" observations.

latent_var_original

list of data.frame. The elements of the list are the datasets used to construct each latent variable. For interpretability and proper convergence, not using the same variable in more than one latent variable is highly recommended. It is recommended to set names to the list elements to prevent confusion because otherwise, the latent variables will be named L1, L2, ...

latent_var_extra

list of data.frame (with the same structure as latent_var_original) containing the additional elements to try including inside the latent variables. Set to NULL if using a backward search.

search_type

If search_type="forward", uses a forward search. If search_type="backward", uses backward search. If search_type="bidirectional-forward", uses bidirectional search (that starts as a forward search). If search_type="bidirectional-backward", uses bidirectional search (that starts as a backward search).

search

If search=0, uses a stepwise search for all latent variables. Otherwise, if search = i, uses a stepwise search on the i-th latent variable (Default = 0).

search_criterion

Criterion used to determine which variable is the best to add or worst to drop. If search_criterion="AIC", uses the AIC, if search_criterion="AICc", uses the AICc, if search_criterion="BIC", uses the BIC (Default = "AIC").

forward_exclude_p_bigger

If p-value > forward_exclude_p_bigger, we do not consider the variable for inclusion in the forward steps (Default = .20). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to 1 to prevent any exclusion here.

backward_exclude_p_smaller

If p-value < backward_exclude_p_smaller, we do not consider the variable for removal in the backward steps (Default = .01). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to 0 to prevent any exclusion here.

exclude_worse_AIC

If AIC with variable > AIC without variable, we ignore the variable (Default = TRUE). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to FALSE to prevent any exclusion here.

max_steps

Maximum number of steps taken (Default = 50).

start_latent_var

Optional list of starting points for each latent variable (The list must have the same length as the number of latent variables and each element of the list must have the same length as the number of variables of the corresponding latent variable).

eps

Threshold for convergence (.01 for quick batch simulations, .0001 for accurate results).

maxiter

Maximum number of iterations.

family

Outcome distribution and link function (Default = gaussian).

ylim

Optional vector containing the known min and max of the outcome variable. Even if your outcome is known to be in [a,b], if you assume a Gaussian distribution, predict() could return values outside this range. This parameter ensures that this never happens. This is not necessary with a distribution that already assumes the proper range (ex: [0,1] with binomial distribution).

seed

Optional seed for bootstrap.

progress

If TRUE, shows the progress done (Default=TRUE).

n_cluster

Number of parallel clusters, I recommend using the number of CPU cores - 1 (Default = 1).

best_subsets

If best_subsets = k, the output will show the k most frequently chosen subsets of variables (Default = 5)

test_only

If TRUE, only uses the first fold for training and predict the others folds; do not train on the other folds. So instead of cross-validation, this gives you train/test and you get the test R-squared as output.

References

Peter C Austin and Jack V Tu. Bootstrap Methods for Developing Predictive Models (2012). dx.doi.org/10.1198/0003130043277.

Mark Reiser, Lanlan Yao, Xiao Wang, Jeanne Wilcox and Shelley Gray. A Comparison of Bootstrap Confidence Intervals for Multi-level Longitudinal Data Using Monte-Carlo Simulation (2017). 10.1007/978-981-10-3307-0_17.

Examples

Run this code
if (FALSE) {
## Example
train = example_3way_3latent(250, 2, seed=777)
# Bootstrap with Bidirectional-backward search for everything based on AIC
# Normally you should use a lot more than 10 iterations and extra CPUs (n_cluster)
boot = bootstrap_var_select(train$data, latent_var_extra=NULL, 
latent_var_original=train$latent_var,
formula=y ~ E*G*Z,search_type="bidirectional-backward", search=0, 
search_criterion="AIC", boot_iter=10, n_cluster=1)
# Assuming it's longitudinal with 5 timepoints, even though it's not
id = factor(rep(1:50,each=5))
boot_longitudinal = bootstrap_var_select(train$data, latent_var_extra=NULL, 
latent_var_original=train$latent_var,
formula=y ~ E*G*Z,search_type="bidirectional-backward", search=0, 
search_criterion="AIC", boot_iter=10, n_cluster=1, boot_group=id)
}

Run the code above in your browser using DataLab