gafs(x, ...)
"gafs"(x, y, iters = 10, popSize = 50, pcrossover = 0.8, pmutation = 0.1, elite = 0, suggestions = NULL, differences = TRUE, gafsControl = gafsControl(), ...)
x
gafsControl
and URL.gafsControl$functions$fit
gafs
gafs
conducts a supervised binary search of the predictor space using a genetic algorithm. See Mitchell (1996) and Scrucca (2013) for more details on genetic algorithms. This function conducts the search of the feature space repeatedly within resampling iterations. First, the training data are split be whatever resampling method was specified in the control function. For example, if 10-fold cross-validation is selected, the entire genetic algorithm is conducted 10 separate times. For the first fold, nine tenths of the data are used in the search while the remaining tenth is used to estimate the external performance since these data points were not used in the search.
During the genetic algorithm, a measure of fitness is needed to guide the search. This is the internal measure of performance. During the search, the data that are available are the instances selected by the top-level resampling (e.g. the nine tenths mentioned above). A common approach is to conduct another resampling procedure. Another option is to use a holdout set of samples to determine the internal estimate of performance (see the holdout argument of the control function). While this is faster, it is more likely to cause overfitting of the features and should only be used when a large amount of training data are available. Yet another idea is to use a penalized metric (such as the AIC statistic) but this may not exist for some metrics (e.g. the area under the ROC curve).
The internal estimates of performance will eventually overfit the subsets to the data. However, since the external estimate is not used by the search, it is able to make better assessments of overfitting. After resampling, this function determines the optimal number of generations for the GA.
Finally, the entire data set is used in the last execution of the genetic algorithm search and the final model is built on the predictor subset that is associated with the optimal number of generations determined by resampling (although the update function can be used to manually set the number of generations).
This is an example of the output produced when gafsControl(verbose = TRUE)
is used:
Fold2 1 0.715 (13) Fold2 2 0.715->0.737 (13->17, 30.4%) * Fold2 3 0.737->0.732 (17->14, 24.0%) Fold2 4 0.737->0.769 (17->23, 25.0%) *
For the second resample (e.g. fold 2), the best subset across all individuals tested in the first generation contained 13 predictors and was associated with a fitness value of 0.715. The second generation produced a better subset containing 17 samples with an associated fitness values of 0.737 (and improvement is symbolized by the *
. The percentage listed is the Jaccard similarity between the previous best individual (with 13 predictors) and the new best. The third generation did not produce a better fitness value but the fourth generation did.
The search algorithm can be parallelized in several places:
allowParallel
option of gafsControl
)
genParallel
option in gafsControl
)
trainControl
)
It is probably best to pick one of these areas for parallelization and the first is likely to produces the largest decrease in run-time since it is the least likely to incur multiple re-starting of the worker processes. Keep in mind that if multiple levels of parallelization occur, this can effect the number of workers and the amount of memory required exponentially.
Scrucca L (2013). GA: A Package for Genetic Algorithms in R. Journal of Statistical Software, 53(4), 1-37. www.jstatsoft.org/v53/i04
Mitchell M (1996), An Introduction to Genetic Algorithms, MIT Press.
gafsControl
, predict.gafs
, caretGA
, rfGA
treebagGA
## Not run:
# set.seed(1)
# train_data <- twoClassSim(100, noiseVars = 10)
# test_data <- twoClassSim(10, noiseVars = 10)
#
# ## A short example
# ctrl <- gafsControl(functions = rfGA,
# method = "cv",
# number = 3)
#
# rf_search <- gafs(x = train_data[, -ncol(train_data)],
# y = train_data$Class,
# iters = 3,
# gafsControl = ctrl)
#
# rf_search
# ## End(Not run)
Run the code above in your browser using DataLab