Learn R Programming

LEGIT (version 1.4.1)

elastic_net_var_select: Elastic net for variable selection in IMLEGIT model

Description

[Fast and accurate, highly recommended] Apply Elastic Net (from the glmnet package) with IMLEGIT to obtain the order of variable removal that makes the most sense. The output shows the information criterion at every step, so you can decide which variable to retain. It is significantly faster (seconds/minutes instead of hours) than all other variable selection approaches (except for stepwise) and it is very accurate. Note that, as opposed to LEGIT/IMLEGIT, the parameters of variables inside the latent variables are not L1-normalized; instead, its the main model parameters which are L1-normalized. This is needed to make elastic net works. It doesn't matter in the end, because we only care about which variables were removed and we only output the IMLEGIT models without elastic net penalization.

Usage

elastic_net_var_select(
  data,
  latent_var,
  formula,
  latent_var_searched = NULL,
  cross_validation = FALSE,
  alpha = 0.75,
  standardize = TRUE,
  lambda_path = NULL,
  lambda_mult = 1,
  lambda_min = 1e-04,
  n_lambda = 100,
  start_latent_var = NULL,
  eps = 0.001,
  maxiter = 100,
  family = gaussian,
  ylim = NULL,
  cv_iter = 5,
  cv_folds = 10,
  folds = NULL,
  Huber_p = 1.345,
  classification = FALSE,
  print = TRUE,
  test_only = FALSE
)

Value

Returns an object of the class "elastic_net_var_select" which is list containing, in the following order: the criterion at each lambda, the coefficients of the latent variables at each lambda, the fits of each IMLEGIT models for each variable retained at each lambda, and the vector of lambda used.

Arguments

data

data.frame of the dataset to be used.

latent_var

list of data.frame. The elements of the list are the datasets used to construct each latent variable. For interpretability and proper convergence, not using the same variable in more than one latent variable is highly recommended. It is recommended to set names to the list elements to prevent confusion because otherwise, the latent variables will be named L1, L2, ... (See examples below for more details)

formula

Model formula. The names of latent_var can be used in the formula to represent the latent variables. If names(latent_var) is NULL, then L1, L2, ... can be used in the formula to represent the latent variables. Do not manually code interactions, write them in the formula instead (ex: G*E1*E2 or G:E1:E2).

latent_var_searched

Optional If not null, you must specify a vector containing all indexes of the latent variables you want to use elastic net on. Ex: If latent_var=list(G=genes, E=env), specifying latent_var_search=c(1,2) will use both, latent_var_search=1 will only do it for G, and latent_var_search=2 will only do it for E.

cross_validation

(Optional) If TRUE, will return cross-validation criterion (slower, but very good criterion).

alpha

The elasticnet mixing parameter (between 0 and 1). 1 leads to lasso, 0 leads to ridge. See glmnet package manual for more information. We recommend somewhere betwen .50 and 1.

standardize

If TRUE, standardize all variables inside every latent_var component. Note that if FALSE, glmnet will still standardize and unstandardize, but it will do so for each model (i.e., when at the step of estimating the parameters of latent variable G it standardize them, apply glmnet, then unstandarize them). This means that fixed parameters in the alternating steps are not standardized when standardize=FALSE. In practice, we found that standardize=FALSE leads to weird paths that do not always make sense. In the end, we only care about the order of the variable removal from the glmnet. We highly recommend standardize=TRUE for best results.

lambda_path

Optional vector of all lambda (penalty term for elastic net, see glmnet package manual). By default, we automatically determine it.

lambda_mult

scalar which multiplies the maximum lambda (penalty term for elastic net, see glmnet package manual) from the lambda path determined automatically. Sometimes, the maximum lambda found automatically is too big or too small and you may not want to spend the effort to manually set your own lambda path. This is where this comes in, you can simply scale lambda max up or down. (Default = 1)

lambda_min

minimum lambda (penalty term for elastic net, see glmnet package manual) from the lambda path. (Default = .0001)

n_lambda

Number of lambda (penalty term for elastic net, see glmnet package manual) in lambda path. Make lower for faster training, or higher for more precision. If you have many variables, make it bigger than 100 (Default = 100).

start_latent_var

Optional list of starting points for each latent variable (The list must have the same length as the number of latent variables and each element of the list must have the same length as the number of variables of the corresponding latent variable).

eps

Threshold for convergence (.01 for quick batch simulations, .0001 for accurate results).

maxiter

Maximum number of iterations.

family

Outcome distribution and link function (Default = gaussian).

ylim

Optional vector containing the known min and max of the outcome variable. Even if your outcome is known to be in [a,b], if you assume a Gaussian distribution, predict() could return values outside this range. This parameter ensures that this never happens. This is not necessary with a distribution that already assumes the proper range (ex: [0,1] with binomial distribution).

cv_iter

Number of cross-validation iterations (Default = 5).

cv_folds

Number of cross-validation folds (Default = 10). Using cv_folds=NROW(data) will lead to leave-one-out cross-validation.

folds

Optional list of vectors containing the fold number for each observation. Bypass cv_iter and cv_folds. Setting your own folds could be important for certain data types like time series or longitudinal data.

Huber_p

Parameter controlling the Huber cross-validation error (Default = 1.345).

classification

Set to TRUE if you are doing classification (binary outcome).

print

If FALSE, nothing except warnings will be printed. (Default = TRUE).

test_only

If TRUE, only uses the first fold for training and predict the others folds; do not train on the other folds. So instead of cross-validation, this gives you train/test and you get the test R-squared as output.

References

Alexia Jolicoeur-Martineau, Ashley Wazana, Eszter Szekely, Meir Steiner, Alison S. Fleming, James L. Kennedy, Michael J. Meaney, Celia M.T. Greenwood and the MAVAN team. Alternating optimization for GxE modelling with weighted genetic and environmental scores: examples from the MAVAN study (2017). arXiv:1703.08111.

Examples

Run this code
if (FALSE) {
N = 1000
train = example_3way(N, sigma=1, logit=FALSE, seed=7)
g1_bad = rbinom(N,1,.30)
g2_bad = rbinom(N,1,.30)
g3_bad = rbinom(N,1,.30)
g4_bad = rbinom(N,1,.30)
g5_bad = rbinom(N,1,.30)
train$G = cbind(train$G, g1_bad, g2_bad, g3_bad, g4_bad, g5_bad)
lv = list(G=train$G, E=train$E)
fit = elastic_net_var_select(train$data, lv, y ~ G*E)
summary(fit)
best_model(fit, criterion="BIC")
 # Instead of taking the best, if you want the model with "Model index"=17 from summary, do
plot(fit)
# With Cross-validation
fit = elastic_net_var_select(train$data, lv, y ~ G*E, cross_validation=TRUE, cv_iter=1, cv_folds=5)
best_model(fit, criterion="cv_R2")
# Elastic net only applied on G
fit = elastic_net_var_select(train$data, lv, y ~ G*E, c(1))
# Elastic net only applied on E
fit = elastic_net_var_select(train$data, lv, y ~ G*E, c(2))
# Most E variables not removed, use lambda_mult > 1 to remove more
fit = elastic_net_var_select(train$data, lv, y ~ G*E, c(2), lambda_mult=5)
# Lasso (only L1 regularization)
fit = elastic_net_var_select(train$data, lv, y ~ G*E, alpha=1)
# Want more lambdas (useful if # of variables is large)
fit = elastic_net_var_select(train$data, lv, y ~ G*E, n_lambda = 200)
}

Run the code above in your browser using DataLab