hyperSMURF.cv: hyperSMURF cross-validation

Description

Automated cross validation of hyperSMURF (hyper-ensemble SMote Undersampled Random Forests)

Usage

hyperSMURF.cv(data, y, kk = 5, n.part = 10, fp = 1, ratio = 1, 
k = 5, ntree = 10, mtry = 5, cutoff = c(0.5, 0.5), thresh = FALSE, 
                       seed = 0, fold.partition = NULL, file = "")

Arguments

data

a data frame or matrix with the data

a factor with the labels. 0:majority class, 1: minority class.

number of folds (def: 5)

n.part

number of partitions (def. 10)

multiplicative factor for the SMOTE oversampling of the minority class If fp<1 no oversampling is performed.

ratio

ratio of the #majority/#minority

number of the nearest neighbours for SMOTE oversampling (def. 5)

ntree

number of trees of the base learner random forest (def. 10)

mtry

number of the features to randomly selected by the decision tree of each base random forest (def. 5)

cutoff

a numeric vector of length 2. Cutoff for respectively the majority and minority class. This parameter is meaningful when used with the thresholded version of hyperSMURF parameter (thresh=TRUE)

thresh

logical. If TRUE the thresholded version of hyperSMURF is executed (def: FALSE)

seed

initialization seed for the random generator. If set to 0(def.) no initialization is performed

fold.partition

vector of size nrow(data) with values in interval [0,kk). The values indicate the fold of the cross validation of each example. If NULL (default) the folds are randomly generated.

file

name of the file where the cross-validated hyperSMURF models will be saved. If file=="" (def.) no model is saved.

Value

a vector with the cross-validated hyperSMURF probabilities (hyperSMURF scores).

Details

The cross-validation is performed by randomly constructing the folds (parameter fold.partition = NULL) or using a set of predefined folds listed in the parameter fold.partition. The cross validation is performed by training and testing in sequence the base random forests. More precisely for each training set constructed at each step of the cross validation a separated random forest is trained sequentially for each of the n.part partitions of the data, by oversampling the minority class (parameter fp) and undersampling the majority class (parameter ratio). The random forest parameters ntree and mtry are the same for all the random forest of the hyper-ensemble.

References

M. Schubach, M. Re, P.N. Robinson and G. Valentini Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants, Scientific Reports, Nature Publishing, 7:2959, 2017.

Examples

Run this code

# NOT RUN {
d <- imbalanced.data.generator(n.pos=10, n.neg=300, sd=0.3);
res<-hyperSMURF.cv (d$data, d$labels, kk=2, n.part=3, fp=1, ratio=1, k=3, ntree=7, 
                    mtry=2, seed = 1, fold.partition=NULL);
# }

Run the code above in your browser using DataLab