[Slow, highly recommended when the number of variables is large] Use natural evolution strategy (nes) gradient descent ran in parallel to find the best subset of variables. It is often as good as genetic algorithms but much faster so it is the recommended variable selection function to use as default. Note that this approach assumes that the inclusion of a variable does not depends on whether other variables are included (i.e. it assumes independent bernouilli distributions); this is generally not true but this approach still converge well and running it in parallel increases the probability of reaching the global optimum.
nes_var_select(
data,
formula,
parallel_iter = 3,
alpha = c(1, 5, 10),
entropy_threshold = 0.05,
popsize = 25,
lr = 0.2,
prop_ignored = 0.5,
latent_var = NULL,
search_criterion = "AICc",
n_cluster = 3,
eps = 0.01,
maxiter = 100,
family = gaussian,
ylim = NULL,
seed = NULL,
progress = TRUE,
cv_iter = 5,
cv_folds = 5,
folds = NULL,
Huber_p = 1.345,
classification = FALSE,
print = FALSE,
test_only = FALSE
)
Returns a list containing the best subset's fit, cross-validation output, latent variables and starting points.
data.frame of the dataset to be used.
Model formula. The names of latent_var
can be used in the formula to represent the latent variables. If names(latent_var
) is NULL, then L1, L2, ... can be used in the formula to represent the latent variables. Do not manually code interactions, write them in the formula instead (ex: G*E1*E2 or G:E1:E2).
number of parallel tries (Default = 3). For speed, I recommend using the number of CPU cores.
vector of the parameter for the Dirichlet distribution of the starting points (Assuming a symmetric Dirichlet distribution with only one parameter). If the vector has size N and parralel_iter=K, we use alpha[1], ..., alpha[N], alpha[1], ... , alpha[N], ... for parallel_iter 1 to K respectively. We assume a dirichlet distribution for the starting points to get a bit more variability and make sure we are not missing on a great subset of variable that doesn't converge to the global optimum with the default starting points. Use bigger values for less variability and lower values for more variability (Default = c(1,5,10)).
Entropy threshold for convergence of the population (Default = .10). The smaller the entropy is, the less diversity there is in the population, which means convergence.
Size of the population, the number of subsets of variables sampled at each iteration (Default = 25). Between 25 and 100 is generally adequate.
learning rate of the gradient descent, higher will converge faster but more likely to get stuck in local optium (Default = .2).
The proportion of the population that are given a fixed fitness value, thus their importance is greatly reduce. The higher it is, the longer it takes to converge. Highers values makes the algorithm focus more on favorizing the good subsets of variables than penalizing the bad subsets (Default = .50).
list of data.frame. The elements of the list are the datasets used to construct each latent variable. For interpretability and proper convergence, not using the same variable in more than one latent variable is highly recommended. It is recommended to set names to the list elements to prevent confusion because otherwise, the latent variables will be named L1, L2, ...
Criterion used to determine which variable subset is the best. If search_criterion="AIC"
, uses the AIC, if search_criterion="AICc"
, uses the AICc, if search_criterion="BIC"
, uses the BIC, if search_criterion="cv"
, uses the cross-validation error, if
search_criterion="cv_AUC"
, uses the cross-validated AUC, if search_criterion="cv_Huber"
, uses the Huber cross-validation error, if search_criterion="cv_L1"
, uses the L1-norm cross-validation error (Default = "AIC"). The Huber and L1-norm cross-validation errors are alternatives to the usual cross-validation L2-norm error (which the \(R^2\) is based on) that are more resistant to outliers, the lower the values the better.
Number of parallel clusters, I recommend using the number of CPU cores (Default = 1).
Threshold for convergence (.01 for quick batch simulations, .0001 for accurate results). Note that using .001 rather than .01 (default) can more than double or triple the computing time of genetic_var_select.
Maximum number of iterations.
Outcome distribution and link function (Default = gaussian).
Optional vector containing the known min and max of the outcome variable. Even if your outcome is known to be in [a,b], if you assume a Gaussian distribution, predict() could return values outside this range. This parameter ensures that this never happens. This is not necessary with a distribution that already assumes the proper range (ex: [0,1] with binomial distribution).
Optional seed.
If TRUE, shows the progress done (Default=TRUE).
Number of cross-validation iterations (Default = 5).
Number of cross-validation folds (Default = 10). Using cv_folds=NROW(data)
will lead to leave-one-out cross-validation.
Optional list of vectors containing the fold number for each observation. Bypass cv_iter and cv_folds. Setting your own folds could be important for certain data types like time series or longitudinal data.
Parameter controlling the Huber cross-validation error (Default = 1.345).
Set to TRUE if you are doing classification and cross-validation (binary outcome).
If TRUE, print the parameters of the search distribution and the entropy at each iteration. Note: Only works using Rterm.exe in Windows due to parallel clusters. (Default = FALSE).
If TRUE, only uses the first fold for training and predict the others folds; do not train on the other folds. So instead of cross-validation, this gives you train/test and you get the test R-squared as output.
if (FALSE) {
## Example
train = example_3way_3latent(250, 2, seed=777)
nes = nes_var_select(train$data, latent_var=train$latent_var,
formula=y ~ E*G*Z)
}
Run the code above in your browser using DataLab