stepwise_search: Stepwise search for the best subset of genetic variants or environments with the LEGIT model

Description

[Fast, recommended for small number of variables] Adds the best variable or drops the worst variable one at a time in the genetic (if search="genes") or environmental score (if search="env"). You can select the desired search criterion (AIC, BIC, cross-validation error, cross-validation AUC) to determine which variable is the best/worst and should be added/dropped. Note that when the number of variables in G and E is large, this does not generally converge to the optimal subset, this function is only recommended when you have a small number of variables (e.g. 2 environments, 6 genetic variants). If using cross-validation (search_criterion="cv" or search_criterion="cv_AUC"), to prevent cross-validating with each variable (extremely slow), we recommend setting a p-value threshold (p_threshold) and forcing the algorithm not to look at models with bigger AIC (exclude_worse_AIC=TRUE).

Usage

stepwise_search(
  data,
  formula,
  interactive_mode = FALSE,
  genes_original = NULL,
  env_original = NULL,
  genes_extra = NULL,
  env_extra = NULL,
  search_type = "bidirectional-forward",
  search = "both",
  search_criterion = "AIC",
  forward_exclude_p_bigger = 0.2,
  backward_exclude_p_smaller = 0.01,
  exclude_worse_AIC = TRUE,
  max_steps = 100,
  cv_iter = 5,
  cv_folds = 10,
  folds = NULL,
  Huber_p = 1.345,
  classification = FALSE,
  start_genes = NULL,
  start_env = NULL,
  eps = 0.01,
  maxiter = 100,
  family = gaussian,
  ylim = NULL,
  seed = NULL,
  print = TRUE,
  remove_miss = FALSE,
  test_only = FALSE
)

Value

Returns an object of the class "LEGIT" which is list containing, in the following order: a glm fit of the main model, a glm fit of the genetic score, a glm fit of the environmental score, a list of the true model parameters (AIC, BIC, rank, df.residual, null.deviance) for which the individual model parts (main, genetic, environmental) don't estimate properly.

Arguments

data: data.frame of the dataset to be used.
formula: Model formula. Use E for the environmental score and G for the genetic score. Do not manually code interactions, write them in the formula instead (ex: G*E*z or G:E:z).
interactive_mode: If TRUE, uses interactive mode. In interactive mode, at each iteration, the user is shown the AIC, BIC, p-value and also the cross-validation \(R^2\) if search_criterion="cv" and the cross-validation AUC if search_criterion="cv_AUC" for the best 5 variables. The user must then enter a number between 1 and 5 to select the variable to be added, entering anything else will stop the search.
genes_original: data.frame of the variables inside the genetic score G (can be any sort of variable, doesn't even have to be genetic).
env_original: data.frame of the variables inside the environmental score E (can be any sort of variable, doesn't even have to be environmental).
genes_extra: data.frame of the additionnal variables to try including inside the genetic score G (can be any sort of variable, doesn't even have to be genetic). Set to NULL if using a backward search.
env_extra: data.frame of the variables to try including inside the environmental score E (can be any sort of variable, doesn't even have to be environmental). Set to NULL if using a backward search.
search_type: If search_type="forward", uses a forward search. If search_type="backward", uses backward search. If search_type="bidirectional-forward", uses bidirectional search (that starts as a forward search). If search_type="bidirectional-backward", uses bidirectional search (that starts as a backward search).
search: If search="genes", uses a stepwise search for the genetic score variables. If search="env", uses a stepwise search for the environmental score variables. If search="both", uses a stepwise search for both the gene and environmental score variables (Default = "both").
search_criterion: Criterion used to determine which variable is the best to add or worst to drop. If search_criterion="AIC", uses the AIC, if search_criterion="AICc", uses the AICc, if search_criterion="BIC", uses the BIC, if search_criterion="cv", uses the cross-validation error, if
search_criterion="cv_AUC", uses the cross-validated AUC, if search_criterion="cv_Huber", uses the Huber cross-validation error, if search_criterion="cv_L1", uses the L1-norm cross-validation error (Default = "AIC"). The Huber and L1-norm cross-validation errors are alternatives to the usual cross-validation L2-norm error (which the \(R^2\) is based on) that are more resistant to outliers, the lower the values the better.
forward_exclude_p_bigger: If p-value > forward_exclude_p_bigger, we do not consider the variable for inclusion in the forward steps (Default = .20). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to 1 to prevent any exclusion here.
backward_exclude_p_smaller: If p-value < backward_exclude_p_smaller, we do not consider the variable for removal in the backward steps (Default = .01). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to 0 to prevent any exclusion here.
exclude_worse_AIC: If AIC with variable > AIC without variable, we ignore the variable (Default = TRUE). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to FALSE to prevent any exclusion here.
max_steps: Maximum number of steps taken (Default = 50).
cv_iter: Number of cross-validation iterations (Default = 5).
cv_folds: Number of cross-validation folds (Default = 10). Using cv_folds=NROW(data) will lead to leave-one-out cross-validation.
folds: Optional list of vectors containing the fold number for each observation. Bypass cv_iter and cv_folds. Setting your own folds could be important for certain data types like time series or longitudinal data.
Huber_p: Parameter controlling the Huber cross-validation error (Default = 1.345).
classification: Set to TRUE if you are doing classification (binary outcome).
start_genes: Optional starting points for genetic score (must be the same length as the number of columns of genes).
start_env: Optional starting points for environmental score (must be the same length as the number of columns of env).
eps: Threshold for convergence (.01 for quick batch simulations, .0001 for accurate results).
maxiter: Maximum number of iterations.
family: Outcome distribution and link function (Default = gaussian).
ylim: Optional vector containing the known min and max of the outcome variable. Even if your outcome is known to be in [a,b], if you assume a Gaussian distribution, predict() could return values outside this range. This parameter ensures that this never happens. This is not necessary with a distribution that already assumes the proper range (ex: [0,1] with binomial distribution).
seed: Seed for cross-validation folds.
print: If TRUE, print all the steps and notes/warnings. Highly recommended unless you are batch running multiple stepwise searchs. (Default=TRUE).
remove_miss: If TRUE, remove missing data completely, otherwise missing data is only removed when adding or dropping a variable (Default = FALSE).
test_only: If TRUE, only uses the first fold for training and predict the others folds; do not train on the other folds. So instead of cross-validation, this gives you train/test and you get the test R-squared as output.

Examples

Run this code

if (FALSE) {
## Continuous example
train = example_3way(250, 2.5, seed=777)
# Forward search for genes based on BIC (in interactive mode)
forward_genes_BIC = stepwise_search(train$data, genes_extra=train$G, env_original=train$E,
formula=y ~ E*G*z,search_type="forward", search="genes", search_criterion="BIC",
interactive_mode=TRUE)
# Bidirectional-backward search for environments based on cross-validation error
bidir_backward_env_cv = stepwise_search(train$data, genes_original=train$G, env_original=train$E,
formula=y ~ E*G*z,search_type="bidirectional-backward", search="env", search_criterion="cv")
## Binary example
train_bin = example_2way(500, 2.5, logit=TRUE, seed=777)
# Forward search for genes based on cross-validated AUC (in interactive mode)
forward_genes_AUC = stepwise_search(train_bin$data, genes_extra=train_bin$G, 
env_original=train_bin$E, formula=y ~ E*G,search_type="forward", search="genes", 
search_criterion="cv_AUC", classification=TRUE, family=binomial, interactive_mode=TRUE)
# Forward search for genes based on AIC
bidir_forward_genes_AIC = stepwise_search(train_bin$data, genes_extra=train_bin$G, 
env_original=train_bin$E, formula=y ~ E*G,search_type="bidirectional-forward", search="genes", 
search_criterion="AIC", classification=TRUE, family=binomial)
}

Run the code above in your browser using DataLab