hyperSMURF.train: hyperSMURF training

Description

A hyperSMURF model is trained on a given data set. Training data are partitioned, and each RF is separately trained on each partition by SMOTE oversampling of the positives (minority class examples) and undersampling of the negatives (majority class examples). Each RF is trained sequentially

Usage

hyperSMURF.train(data, y, n.part = 10, fp = 1, ratio = 1, k = 5, ntree = 10, 
                 mtry = 5, cutoff = c(0.5, 0.5), seed = 0, file = "")

Arguments

data

a data frame or matrix with the train data. Rows: examples; columns: features

a factor with the labels. 0:majority class, 1: minority class.

n.part

number of partitions (def. 10)

multiplicative factor for the SMOTE oversampling of the minority class. If fp<1 no oversampling is performed.

ratio

ratio of the #majority/#minority

number of the nearest neighbours for SMOTE oversampling (def. 5)

ntree

number of trees of the base learner random forest (def. 10)

mtry

number of the features to randomly selected by the decision tree of each base random forest (def.5)

cutoff

a numeric vector of length 2. Cutoff for respectively the majority and minority class. This parameter is meaningful when used with the thresholded version of hyperSMURF (parameter thresh=TRUE)

seed

initialization seed for the random generator. If set to 0(def.) no initialization is performed

file

name of the file where the cross-validated hyperSMURF models will be saved. If file=="" (def.) no model is saved.

Value

A list of trained RF models. Each element of the list is a randomForest objects of the homonymous package.

Details

A different random forest is trained on each partition of the training set. If npos and nneg are the the number of respectively the positive and negative examples, for each partition of the training data fp*npos new synthetic positives constructed by the SMOTE algorithm are added to the training set. The number of negatives is set to ratio*(fp*npos + npos). If no enough negatives are available in the partition, then all the negatives in the partition are used to train the base RF associated to the partition.

References

M. Schubach, M. Re, P.N. Robinson and G. Valentini Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants, Scientific Reports, Nature Publishing, 7:2959, 2017.

Examples

Run this code

# NOT RUN {
train <- imbalanced.data.generator(n.pos=20, n.neg=1000, 
          n.features=10, n.inf.features=2, sd=1, seed=1);
HSmodel <- hyperSMURF.train(train$data, train$label, n.part = 5, fp = 1, ratio = 2);
# }

Run the code above in your browser using DataLab