Learn R Programming

hybridEnsemble (version 1.7.9)

CVhybridEnsemble: Five times twofold cross-validation for the Hybrid Ensemble function

Description

CVhybridEnsemble cross-validates (five times twofold) (hybridEnsemble) and computes performance statistics that can be plotted (plot.CVhybridEnsemble) and summarized (summary.CVhybridEnsemble).

Usage

CVhybridEnsemble(
  x = NULL,
  y = NULL,
  algorithms = c("LR", "RF", "AB", "KF", "NN", "SV", "RoF", "KN", "NB"),
  combine = NULL,
  eval.measure = "auc",
  diversity = FALSE,
  parallel = FALSE,
  verbose = FALSE,
  oversample = TRUE,
  calibrate = FALSE,
  filter = 0.03,
  LR.size = 10,
  RF.ntree = 500,
  AB.iter = 500,
  AB.maxdepth = 3,
  KF.cp = 1,
  KF.rp = round(log(nrow(x), 10)),
  KF.ntree = 500,
  NN.rang = 0.1,
  NN.maxit = 10000,
  NN.size = c(5, 10, 20),
  NN.decay = c(0, 0.001, 0.01, 0.1),
  NN.skip = c(TRUE, FALSE),
  NN.ens.size = 10,
  SV.gamma = 2^(-15:3),
  SV.cost = 2^(-5:13),
  SV.degree = c(2, 3),
  SV.kernel = c("radial", "sigmoid", "linear", "polynomial"),
  SV.size = 10,
  RoF.L = 10,
  KN.K = c(1:150),
  KN.size = 10,
  NB.size = 10,
  rbga.popSize = length(algorithms) * 14,
  rbga.iters = 500,
  rbga.mutationChance = 1/rbga.popSize,
  rbga.elitism = max(1, round(rbga.popSize * 0.05)),
  DEopt.nP = 20,
  DEopt.nG = 500,
  DEopt.F = 0.9314,
  DEopt.CR = 0.6938,
  GenSA.maxit = 500,
  GenSA.temperature = 0.5,
  GenSA.visiting.param = 2.7,
  GenSA.acceptance.param = -5,
  GenSA.max.call = 1e+07,
  malschains.popsize = 60,
  malschains.ls = "cmaes",
  malschains.istep = 300,
  malschains.effort = 0.5,
  malschains.alpha = 0.5,
  malschains.threshold = 1e-08,
  malschains.maxEvals = 500,
  psoptim.maxit = 500,
  psoptim.maxf = Inf,
  psoptim.abstol = -Inf,
  psoptim.reltol = 0,
  psoptim.s = 40,
  psoptim.k = 3,
  psoptim.p = 1 - (1 - 1/psoptim.s)^psoptim.k,
  psoptim.w = 1/(2 * log(2)),
  psoptim.c.p = 0.5 + log(2),
  psoptim.c.g = 0.5 + log(2),
  soma.pathLength = 3,
  soma.stepLength = 0.11,
  soma.perturbationChance = 0.1,
  soma.minAbsoluteSep = 0,
  soma.minRelativeSep = 0.001,
  soma.nMigrations = 500,
  soma.populationSize = 10,
  tabu.iters = 500,
  tabu.listSize = c(5:12)
)

Value

A list of class CVhybridEnsemble containing the following elements:

MEAN

For the simple mean combination method: A list containing the median and inter quartile range of the performance evaluations, the performance evaluations on each fold, and the predictions and reponse vectors for each fold.

AUTHORITY

For the authority combination method: A list containing the median and inter quartile range of the performance evaluations, the performance evaluations on each fold, and the predictions and reponse vectors for each fold.

SB

For the single best: A list containing the median and inter quartile range of the performance evaluations, the performance evaluations on each fold, and the predictions and reponse vectors for each fold.

..and all the combination methods that are requested.

eval.measure

The performance measure that was used

diversity

Data frame containing the diversity (1 minus the absolute value of the mean of the pairwise correlations), and mean auc and accuracy(threshold=0.5) of the hybrid ensemble and the sub-ensembles.

Arguments

x

A data frame of predictors. Categorical variables need to be transformed to binary (dummy) factors.

y

A factor of observed class labels (responses) with the only allowed values {0,1}.,

algorithms

Which algorihtms to use {"LR","RF","AB","KF","NN","SV","RoF","KN","NB"}. LR= Bagged Logistic Regression, RF=Random Forest, AB= AdaBoost, KF= Kernel Factory, NN= Bagged Neural Network, SV= Bagged Support Vector Machines, RoF= Rotation Forest, KN= Bagged K- Nearest Neighbors, NB= Bagged Naive Bayes.

combine

Additional methods for combining the sub-ensembles. The simple mean, authority-based weighting and the single best are automatically provided since they are very effficient. Possible additional methods: Genetic Algorithm: "rbga", Differential Evolutionary Algorithm: "DEopt", Generalized Simulated Annealing: "GenSA", Memetic Algorithm with Local Search Chains: "malschains", Particle Swarm Optimization: "psoptim", Self-Organising Migrating Algorithm: "soma", Tabu Search Algorithm: "tabu", Non-negative binomial likelihood: "NNloglik", Goldfarb-Idnani Non-negative least squares: "GINNLS", Lawson-Hanson Non-negative least squares: "LHNNLS".

eval.measure

Evaluation measure for the following combination methods: authority-based method, single best, "rbga", "DEopt", "GenSA", "malschains", "psoptim", "soma", "tabu". Default is the area under the receiver operator characteristic curve 'auc'. The area under the sensitivity curve ('sens') and the area under the specificity curve ('spec') are also supported.

diversity

TRUE or FALSE. Will set predict.all=TRUE in hybridEnsemble and compute diversity at the sub-ensemble and hybrid (i.e., meta) -ensemble level? Diversity is defined as 1 minus the absolute value of the mean of the pairwise correlations. The AUC will also be provided. For the AUC of the meta-ensemble the simple mean is used.

parallel

TRUE or FALSE. Should the cross-validation be executed in parallel. Will use all available cores.

verbose

TRUE or FALSE. Should information be printed to the screen while estimating the Hybrid Ensemble.

oversample

TRUE or FALSE. Should oversampling be used? Setting oversample to TRUE helps avoid computational problems related to the subsetting process.

calibrate

TRUE or FALSE. If FALSE percentile ranks of the prediction vectors will be used.

filter

either NULL (deactivate) or a percentage denoting the minimum class size of dummy predictors. This parameter is used to remove near constants. For example if nrow(xTRAIN)=100, and filter=0.01 then all dummy predictors with any class size equal to 1 will be removed. Set this higher (e.g., 0.05 or 0.10) in case of errors.

LR.size

Logistic Regression parameter. Ensemble size of the bagged logistic regression sub-ensemble.

RF.ntree

Random Forest parameter. Number of trees to grow.

AB.iter

Stochastic AdaBoost parameter. Number of boosting iterations to perform.

AB.maxdepth

Stochastic AdaBoost parameter. The maximum depth of any node of the final tree, with the root node counted as depth 0.

KF.cp

Kernel Factory parameter. The number of column partitions.

KF.rp

Kernel Factory parameter. The number of row partitions.

KF.ntree

Kernel Factory parameter. Number of trees to grow.

NN.rang

Neural Network parameter. Initial random weights on [-rang, rang].

NN.maxit

Neural Network parameter. Maximum number of iterations.

NN.size

Neural Network parameter. Number of units in the single hidden layer. Can be mutiple values that need to be optimized.

NN.decay

Neural Network parameter. Weight decay. Can be mutiple values that need to be optimized.

NN.skip

Neural Network parameter. Switch to add skip-layer connections from input to output. Can be boolean vector (TRUE and FALSE) for optimization.

NN.ens.size

Neural Network parameter. Ensemble size of the neural network sub-ensemble.

SV.gamma

Support Vector Machines parameter. Width of the Guassian for radial basis and sigmoid kernel. Can be mutiple values that need to be optimized.

SV.cost

Support Vector Machines parameter. Penalty (soft margin constant). Can be mutiple values that need to be optimized.

SV.degree

Support Vector Machines parameter. Degree of the polynomial kernel. Can be mutiple values that need to be optimized.

SV.kernel

Support Vector Machines parameter. Kernels to try. Can be one or more of: 'radial','sigmoid','linear','polynomial'. Can be mutiple values that need to be optimized.

SV.size

Support Vector Machines parameter. Ensemble size of the SVM sub-ensemble.

RoF.L

Rotation Forest parameter. Number of trees to grow.

KN.K

K-Nearest Neighbors parameter. Number of nearest neighbors to try. For example c(10,20,30). The optimal K will be selected. If larger than nrow(xTRAIN) the maximum K will be reset to 50% of nrow(xTRAIN). Can be mutiple values that need to be optimized.

KN.size

K-Nearest Neighbors parameter. Ensemble size of the K-nearest neighbor sub-ensemble.

NB.size

Naive Bayes parameter. Ensemble size of the bagged naive bayes sub-ensemble.

rbga.popSize

Genetic Algorithm parameter. Population size. Default is 14 times the number of variables.

rbga.iters

Genetic Algorithm parameter. Number of iterations.

rbga.mutationChance

Genetic Algorithm parameter. The chance that a gene in the chromosome mutates.

rbga.elitism

Genetic Algorithm parameter. Number of chromosomes that are kept into the next generation.

DEopt.nP

Differential Evolutionary Algorithm parameter. Population size.

DEopt.nG

Differential Evolutionary Algorithm parameter. Number of generations.

DEopt.F

Differential Evolutionary Algorithm parameter. Step size.

DEopt.CR

Differential Evolutionary Algorithm parameter. Probability of crossover.

GenSA.maxit

Generalized Simulated Annealing. Maximum number of iterations.

GenSA.temperature

Generalized Simulated Annealing. Initial value for temperature.

GenSA.visiting.param

Generalized Simulated Annealing. Parameter for visiting distribution.

GenSA.acceptance.param

Generalized Simulated Annealing. Parameter for acceptance distribution.

GenSA.max.call

Generalized Simulated Annealing. Maximum number of calls of the objective function.

malschains.popsize

Memetic Algorithm with Local Search Chains parameter. Population size.

malschains.ls

Memetic Algorithm with Local Search Chains parameter. Local search method.

malschains.istep

Memetic Algorithm with Local Search Chains parameter. Number of iterations of the local search.

malschains.effort

Memetic Algorithm with Local Search Chains parameter. Value between 0 and 1. The ratio between the number of evaluations for the local search and for the evolutionary algorithm. A higher effort means more evaluations for the evolutionary algorithm.

malschains.alpha

Memetic Algorithm with Local Search Chains parameter. Crossover BLX-alpha. Lower values (<0.3) reduce diversity and a higher value increases diversity.

malschains.threshold

Memetic Algorithm with Local Search Chains parameter. Threshold that defines how much improvement in the local search is considered to be no improvement.

malschains.maxEvals

Memetic Algorithm with Local Search Chains parameter. Maximum number of evaluations.

psoptim.maxit

Particle Swarm Optimization parameter. Maximum number of iterations.

psoptim.maxf

Particle Swarm Optimization parameter. Maximum number of function evaluations.

psoptim.abstol

Particle Swarm Optimization parameter. Absolute convergence tolerance.

psoptim.reltol

Particle Swarm Optimization parameter. Tolerance for restarting.

psoptim.s

Particle Swarm Optimization parameter. Swarm size.

psoptim.k

Particle Swarm Optimization parameter. Exponent for calculating number of informants.

psoptim.p

Particle Swarm Optimization parameter. Average percentage of informants for each particle.

psoptim.w

Particle Swarm Optimization parameter. Exploitation constant.

psoptim.c.p

Particle Swarm Optimization parameter. Local exploration constant.

psoptim.c.g

Particle Swarm Optimization parameter. Global exploration constant.

soma.pathLength

Self-Organising Migrating Algorithm parameter. Distance (towards the leader) that individuals may migrate.

soma.stepLength

Self-Organising Migrating Algorithm parameter. Granularity at which potential steps are evaluated.

soma.perturbationChance

Self-Organising Migrating Algorithm parameter. Probability that individual parameters are changed on any given step.

soma.minAbsoluteSep

Self-Organising Migrating Algorithm parameter. Smallest absolute difference between maximum and minimum cost function values. Below this minimum the algorithm will terminate.

soma.minRelativeSep

Self-Organising Migrating Algorithm parameter. Smallest relative difference between maximum and minimum cost function values. Below this minimum the algorithm will terminate.

soma.nMigrations

Self-Organising Migrating Algorithm parameter. Maximum number of migrations to complete.

soma.populationSize

Self-Organising Migrating Algorithm parameter. Population size.

tabu.iters

Number of iterations in the preliminary search of the algorithm.

tabu.listSize

Tabu list size.

Author

Michel Ballings, Dauwe Vercamer, Matthias Bogaert, and Dirk Van den Poel, Maintainer: Michel.Ballings@GMail.com

References

Ballings, M., Vercamer, D., Bogaert, M., Van den Poel, D.

See Also

hybridEnsemble, predict.hybridEnsemble, importance.hybridEnsemble, plot.CVhybridEnsemble, summary.CVhybridEnsemble

Examples

Run this code

data(Credit)

if (FALSE) {
x <- Credit[1:200,names(Credit) != 'Response']
x <- x[,sapply(x,is.numeric)]
CVhE <- CVhybridEnsemble(x=x,
                    y=Credit$Response[1:200],
                    verbose=TRUE,
                    KF.rp=1,
                    RF.ntree=50,
                    AB.iter=50,
                    NN.size=5,
                    NN.decay=0,
                    SV.gamma = 2^-15,
                    SV.cost = 2^-5,
                    SV.degree=2,
                    SV.kernel='radial')
}

Run the code above in your browser using DataLab