- data
data.frame of the dataset to be used.
- formula
Model formula. Use E for the environmental score and G for the genetic score. Do not manually code interactions, write them in the formula instead (ex: G*E*z or G:E:z).
- interactive_mode
If TRUE, uses interactive mode. In interactive mode, at each iteration, the user is shown the AIC, BIC, p-value and also the cross-validation \(R^2\) if search_criterion="cv"
and the cross-validation AUC if search_criterion="cv_AUC"
for the best 5 variables. The user must then enter a number between 1 and 5 to select the variable to be added, entering anything else will stop the search.
- genes_original
data.frame of the variables inside the genetic score G (can be any sort of variable, doesn't even have to be genetic).
- env_original
data.frame of the variables inside the environmental score E (can be any sort of variable, doesn't even have to be environmental).
- genes_extra
data.frame of the additionnal variables to try including inside the genetic score G (can be any sort of variable, doesn't even have to be genetic). Set to NULL if using a backward search.
- env_extra
data.frame of the variables to try including inside the environmental score E (can be any sort of variable, doesn't even have to be environmental). Set to NULL if using a backward search.
- search_type
If search_type="forward"
, uses a forward search. If search_type="backward"
, uses backward search. If search_type="bidirectional-forward"
, uses bidirectional search (that starts as a forward search). If search_type="bidirectional-backward"
, uses bidirectional search (that starts as a backward search).
- search
If search="genes"
, uses a stepwise search for the genetic score variables. If search="env"
, uses a stepwise search for the environmental score variables. If search="both"
, uses a stepwise search for both the gene and environmental score variables (Default = "both").
- search_criterion
Criterion used to determine which variable is the best to add or worst to drop. If search_criterion="AIC"
, uses the AIC, if search_criterion="AICc"
, uses the AICc, if search_criterion="BIC"
, uses the BIC, if search_criterion="cv"
, uses the cross-validation error, if
search_criterion="cv_AUC"
, uses the cross-validated AUC, if search_criterion="cv_Huber"
, uses the Huber cross-validation error, if search_criterion="cv_L1"
, uses the L1-norm cross-validation error (Default = "AIC"). The Huber and L1-norm cross-validation errors are alternatives to the usual cross-validation L2-norm error (which the \(R^2\) is based on) that are more resistant to outliers, the lower the values the better.
- forward_exclude_p_bigger
If p-value > forward_exclude_p_bigger
, we do not consider the variable for inclusion in the forward steps (Default = .20). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to 1 to prevent any exclusion here.
- backward_exclude_p_smaller
If p-value < backward_exclude_p_smaller
, we do not consider the variable for removal in the backward steps (Default = .01). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to 0 to prevent any exclusion here.
- exclude_worse_AIC
If AIC with variable > AIC without variable, we ignore the variable (Default = TRUE). This is an exclusion option which purpose is skipping variables that are likely not worth looking to make the algorithm faster, especially with cross-validation. Set to FALSE to prevent any exclusion here.
- max_steps
Maximum number of steps taken (Default = 50).
- cv_iter
Number of cross-validation iterations (Default = 5).
- cv_folds
Number of cross-validation folds (Default = 10). Using cv_folds=NROW(data)
will lead to leave-one-out cross-validation.
- folds
Optional list of vectors containing the fold number for each observation. Bypass cv_iter and cv_folds. Setting your own folds could be important for certain data types like time series or longitudinal data.
- Huber_p
Parameter controlling the Huber cross-validation error (Default = 1.345).
- classification
Set to TRUE if you are doing classification (binary outcome).
- start_genes
Optional starting points for genetic score (must be the same length as the number of columns of genes
).
- start_env
Optional starting points for environmental score (must be the same length as the number of columns of env
).
- eps
Threshold for convergence (.01 for quick batch simulations, .0001 for accurate results).
- maxiter
Maximum number of iterations.
- family
Outcome distribution and link function (Default = gaussian).
- ylim
Optional vector containing the known min and max of the outcome variable. Even if your outcome is known to be in [a,b], if you assume a Gaussian distribution, predict() could return values outside this range. This parameter ensures that this never happens. This is not necessary with a distribution that already assumes the proper range (ex: [0,1] with binomial distribution).
- seed
Seed for cross-validation folds.
- print
If TRUE, print all the steps and notes/warnings. Highly recommended unless you are batch running multiple stepwise searchs. (Default=TRUE).
- remove_miss
If TRUE, remove missing data completely, otherwise missing data is only removed when adding or dropping a variable (Default = FALSE).
- test_only
If TRUE, only uses the first fold for training and predict the others folds; do not train on the other folds. So instead of cross-validation, this gives you train/test and you get the test R-squared as output.