cv.ses: Cross-Validation for SES

Description

The function performs a k-fold cross-validation for identifying the best values for the SES 'max_k' and 'threshold' hyper-parameters

Usage

cv.ses (target, dataset, kfolds = 10, folds = NULL, alphas = NULL, max_ks = NULL,
task = NULL, metric = NULL, modeler = NULL, ses_test = NULL)

Arguments

target

The target or class variable as in SES.

dataset

The dataset object as in SES.

kfolds

The number of the folds in the k-fold Cross Validation (integer).

folds

The folds of the data to use (a list generated by the function generateCVRuns {TunePareto}). If NULL the folds are created internally with the same function.

alphas

A vector of SES thresholds hyper parameters used in CV. Default is c(0.1, 0.05, 0.01).

max_ks

A vector of SES max_ks parameters used in CV. Default is c(3, 2).

task

A character ("C", "R" or "S"). It can be "C" for classification (logistic, multinomial or ordinal regression), "R" for regression (robust and non robust linear regression, median regression, poisson and negative binomial regression, beta regression), "S"

metric

A metric function provided by the user. If NULL the following functions will be used: auc.mxm, mse.mxm, ci.mxm for classification, regression and survival analysis tasks, respectively. See details for more.

modeler

A modeling function provided by the user. If NULL the following functions will be used: glm.mxm, lm.mxm, coxph.mxm for classification, regression and survival analysis tasks, respectively. See details for more.

ses_test

A function object that defines the conditional independence test used in the SES function (see also SES help page). If NULL, testIndFisher, testIndLogistic and censIndLR are used for classification, regression and survival analysis tasks, respectively.

Value

A list including:
cv_results_allA list with predictions, performances and signatures for each fold and each SES configuration (e.g cv_results_all[[3]]$performances[1] indicates the performance of the 1st fold with the 3d configuration of SES).
best_performanceA numeric value that represents the best average performance.
BC_best_perfA numeric value that represents the bias corrected best average performance.
best_configurationA list that corresponds to the best configuration of SES including id, threshold (named 'a') and max_k.

Details

Note that the Tibshirani and Tibshirani (2009) bias correction method is applied. Input for metric functions: predictions: A vector of predictions to be tested. test_target: target variable actual values to be compared with the predictions. The output of a metric function is a single numeric value. Higher values indicate better performance. Metric based on error measures should be modified accordingly (e.g., multiplying the error for -1) The metric functions that are currently supported are:

auc.mxm: "area under the receiver operator characteristic curve" metric, as provided in the package ROCR.
acc.mxm: accuracy metric.
mse.mxm: -1 * (mean squared error), for robust and non robust linear regression and median (quantile) regression.
ci.mxm: 1 - concordance index as provided in the rcorr.cens function from the Hmisc package. This is to be used with the Cox proportional hazards model only.
ciwr.mxm concordance index as provided in the rcorr.cens function from the Hmisc package. This is to be used with the Weibull regression model only.
poisdev.mxm: Poisson regression deviance.
nbdev.mxm: Negative binomial regression deviance.

Usage: metric(predictions, test_target) Input of modelling functions: train_target: target variable used in the training procedure. sign_data: training set. sign_test: test set. Modelling functions provide a single vector of predictions obtained by applying the model fit on sign_data and train_target on the sign_test The modelling functions that are currently supported are:

glm.mxm: fits a glm for a binomial family (Classification task).
lm.mxm: fits a linear model model (stats) for the regression task.
coxph.mxm: fits a cox proportional hazards regression model for the survival task.
weibreg.mxm: fits a Weibull regression model for the survival task.
rq.mxm: fits a quantile (median) regression model for the regression task.
rlm.mxm: fits a robust linear model model for the regression task.
pois.mxm: fits a poisson regression model model for the regression task.
nb.mxm: fits a negative binomial regression model model for the regression task.
multinom.mxm: fits a multinomial regression model model for the regression task.
ordinal.mxm: fits an ordinal regression model model for the regression task.
beta.mxm: fits a beta regression model model for the regression task. The predicted values are transformed into$R$using the logit transformation. This is so that the "mse.mxm" metric function can be used. In addition, this way the performance can be compared with the regression scenario, where the logit is applied and then a regression model is employed.

Usage: modeler(train_target, sign_data, sign_test)

References

Tibshirani R.J., and Tibshirani R. (2009). A bias correction for the minimum error rate in cross-validation. The Annals of Applied Statistics 3(2): 822-829.

Examples

Run this code

set.seed(1234)

#simulate a dataset with continuous data
dataset <- matrix( rnorm(100 * 100), ncol = 100 )
#the target feature is the last column of the dataset as a vector
target <- dataset[, 100]
dataset <- dataset[, -100]

#get 50 percent of the dataset as a train set
train_set <- dataset[1:50, ]
train_target <- target[1:50]

#run a 10 fold CV for the regression task
best_model = cv.ses(target = train_target, dataset = train_set, kfolds = 10, task = "R")

#get the results
best_model$best_configuration
best_model$best_performance

#summary elements of the process. Press tab after each $ to view all the elements and
#choose the one you are intresting in.
#best_model$cv_results_all[[...]]$...
#i.e.
#mse value for the 1st configuration of SES of the 5 fold
abs(best_model$cv_results_all[[1]]$performances[5])

best_a <- best_model$best_configuration$a
best_max_k <- best_model$best_configuration$max_k

Run the code above in your browser using DataLab