ROSE.eval: Evaluation of learner accuracy by ROSE

Description

Given a classifier and a set of data, this function exploits ROSE generation of synthetic samples to provide holdout, bootstrap or leave-K-out cross-validation estimates of a specified accuracy measure.

Usage

ROSE.eval(formula, data, learner, acc.measure="auc", extr.pred=NULL, 
          method.assess="holdout", K=1, B=100, control.rose=list(),
          control.learner=list(), control.predict=list(), 
          control.accuracy=list(), trace=FALSE, 
          subset=options("subset")$subset, 
          na.action=options("na.action")$na.action, seed)

Arguments

formula

An object of class formula (or one that can be coerced to that class). The specification of the formula must be suited for the selected classifier. See ROSE and the ``Note'' below for information about interaction among predictors or their transformations.

data

An optional data frame, list or environment (or object coercible to a data frame by as.data.frame) in which to preferentially interpret ``formula''. If not specified, the variables are taken from ``environment(formula)''.

learner

Either a built-in R or an user defined function that fits a classifier and that returns a vector of predicted values. See ``Details'' below.

acc.measure

One among c("auc", "precision", "recall", "F"), it defines the accuracy measure to be estimated. Function roc.curve is internally called when auc="auc" while the other options entail an internal call of function accuracy.meas. Default value is "auc".

extr.pred

An optional function that extracts from the output of a predict function the vector of predicted values. If not specified, the value returned by ``predict'' is used. See Examples below.

method.assess

One among c("holdout", "LKOCV", "BOOT"), it is the method used for model assessment. When "holdout" is chosen, the learner is fitted on one ROSE sample and tested on the data provided in formula. "LKOCV" stands for ``leave-K-out cross validation": the original data set is divided into \(Q\) subsets of K observations; at each round, the specified learner is estimated on a ROSE sample built on the provided data but one of these groups and then a prediction on the excluded set of observations is made. At the end of the process, the \(Q\) distinct predictions are deployed to compute the selected accuracy measure. "BOOT" estimates the accuracy measure by fitting a learner on B ROSE samples and testing each of them on the provided data.

An integer value indicating the size of the subsets created when method.assess="LKOCV". If K is not a multiple of the sample size \(n\), then \(Q-1\) sets of size K are created and the remaining \(n-(Q-1)K\) observations are used to form the last subset. Default value is 1, i.e. leave-1-out cross validation is performed.

The number of bootstrap replications to set when method.assess="BOOT". Ignored otherwise. Default value is 100.

control.learner

Further arguments to be passed to learner

control.rose

Optional arguments to be passed to ROSE.

control.predict

Further arguments to be passed to predict.

control.accuracy

Optional arguments to be passed to either roc.curve or accuracy.meas depending on the selected accuracy measure.

trace

logical, if TRUE traces information on the progress of model assessment (number of bootstrap or cross validation iterations performed).

subset

An optional vector specifying a subset of observations to be used in the sampling and learning process. The default is set by the subset setting of options.

na.action

A function which indicates what should happen when the data contain 'NA's. The default is set by the na.action setting of options.

seed

A single value, interpreted as an integer, recommended to specify seeds and keep trace of the generated ROSE sample/es.

Value

The value is an object of class ROSE.eval which has components

Call

The matched call.

method

The selected method for model assessment.

measure

The selected measure to evaluate accuracy.

acc

The vector of the estimated measure of accuracy. It has length \(1\) if method.assess="holdout", or method.assess="LKOCV" and length B if method.assess="BOOT", corresponding to the bootstrap distribution of the accuracy estimator.

Details

This function estimates a measure of accuracy of a classifier specified by the user by using either holdout, cross-validation, or bootstrap estimators. Operationally, the classifier is trained over synthetic data generated by ROSE and then evaluated on the original data.

Whatever accuracy measure and estimator are chosen, the true accuracy depends on the probability distribution underlying the training data. This is clearly affected by the imbalance and its estimation is then regulated by argument control.rose. A default setting of the arguments (that is, p=0.5) entails the estimation of the learner accuracy conditional to a balanced training set. In order to estimate the accuracy of a learner fitted on unbalanced data, the user may set argument p of control.rose to the proportion of positive examples in the observed sample. See Example 2 below and, for further details, Menardi and Torelli (2014).

To the aim of a grater flexibility, ROSE.eval is not linked to the use of a specific learner and works virtually with any classifier. The actual implementation supports the following two type of learner.

In the first case, learner has a 'standard' behavior in the sense that it is a function having formula as a mandatory argument and retrieves an object whose class is associated to a predict method. The user that is willing to define her/his own learner must follow the implicit convention that when a classed object is created, then the function name and the class should match (such as lm, glm, rpart, tree, nnet, lda, etc). Furthermore, since predict returns are very heterogeneous, the user is allowed to define some function extr.pred which extracts from the output of predict the desired vector of predicted values.

In the second case, learner is a wrapper that allows to embed functions that do not meet the aforementioned requirements. The wrapper must have the following mandatory arguments: data and newdata, and must return a vector of predicted values. Optional arguments can be passed as well into the wrapper including the ... and by specifiyng them through control.learner. When argument data in ROSE.eval is not missing, data in learner receives a data frame structured as the one in input, otherwise it is constructed according to the template provided by formula. The same rule applies for argument newdata with the exception that the class label variable is dropped. See ``Examples'' below.

References

Lunardon, N., Menardi, G., and Torelli, N. (2014). ROSE: a Package for Binary Imbalanced Learning. R Jorunal, 6:82--92.

Menardi, G. and Torelli, N. (2014). Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28:92--122.

Examples

Run this code

# NOT RUN {
# 2-dimensional data
# loading data
data(hacide)

# in the following examples 
# use of a small subset of observations only --> argument subset

dat <- hacide.train

table(dat$cls)

##Example 1
# classification with logit model
# arguments to glm are passed through control.learner
# leave-one-out cross-validation estimate of auc of classifier
# trained on balanced data
ROSE.eval(cls~., data=dat, glm, subset=c(1:50, 981:1000), 
          method.assess="LKOCV", K=5,
          control.learner=list(family=binomial), seed=1)

# }
# NOT RUN {
##Example 2
# classification with decision tree 
# require package rpart
library(rpart)

# function is needed to extract predicted probability of cls 1 
f.pred.rpart <- function(x) x[,2]

# holdout estimate of auc of two classifiers

# first classifier trained on ROSE unbalanced sample
# proportion of rare events in original data
p <- (table(dat$cls)/sum(table(dat$cls)))[2]
ROSE.eval(cls~., data=dat, rpart, subset=c(1:50, 981:1000),
          control.rose=list(p = p), extr.pred=f.pred.rpart, seed=1)

# second classifier trained on ROSE balanced sample
# optional arguments to plot the roc.curve are passed through 
# control.accuracy 
ROSE.eval(cls~., data=dat, rpart, subset=c(1:50, 981:1000), 
          control.rose=list(p = 0.5), control.accuracy = list(add.roc = TRUE, 
          col = 2), extr.pred=f.pred.rpart, seed=1)

##Example 3
# classification with linear discriminant analysis
library(MASS)

# function is needed to extract the predicted values from predict.lda
f.pred.lda <- function(z) z$posterior[,2]

# bootstrap estimate of precision of learner trained on balanced data
prec.distr <- ROSE.eval(cls~., data=dat, lda, subset=c(1:50, 981:1000), 
                        extr.pred=f.pred.lda, acc.measure="precision",
                        method.assess="BOOT", B=100, trace=TRUE)

summary(prec.distr)

##Example 4
# compare auc of classification with neural network
# with auc of classification with tree 
# require package nnet
# require package tree

library(nnet)
library(tree)

# optional arguments to nnet are passed through control.learner 
ROSE.eval(cls~., data=dat, nnet, subset=c(1:50, 981:1000), 
          method.assess="holdout", control.learn=list(size=1), seed=1)

# optional arguments to plot the roc.curve are passed through 
# control.accuracy
# a function is needed to extract predicted probability of class 1 
f.pred.rpart <- function(x) x[,2] 
f.pred.tree  <- function(x) x[,2] 
ROSE.eval(cls~., data=dat, tree, subset=c(1:50, 981:1000), 
          method.assess="holdout", extr.pred=f.pred.tree, 
          control.acc=list(add=TRUE, col=2), seed=1)

##Example 5
# An user defined learner with a standard behavior
# Consider a dummy example for illustrative purposes only
# Note that function name and the name of the class returned match
DummyStump <- function(formula, ...)
{
   mc <- match.call()
   m <- match(c("formula", "data", "na.action", "subset"), names(mc), 0L)
   mf <- mc[c(1L, m)]
   mf[[1L]] <- as.name("model.frame")
   mf <- eval(mf, parent.frame())  
   data.st <- data.frame(mf)
   out <- list(colname=colnames(data.st)[2], threshold=1)
   class(out) <- "DummyStump"
   out
}

# Associate to DummyStump a predict method
# Usual S3 definition: predic.classname
predict.DummyStump <- function(object, newdata)
{
   out <- newdata[,object$colname]>object$threshold
   out
}

ROSE.eval(formula=cls~., data=dat, learner=DummyStump, 
          subset=c(1:50, 981:1000), method.assess="holdout", seed=3)


##Example 6
# The use of the wrapper for a function with non standard behaviour
# Consider knn in package class
# require package class

library(class)

# the wrapper require two mandatory arguments: data, newdata.
# optional arguments can be passed by including the object '...'
# note that we are going to specify data=data in ROSE.eval
# therefore data in knn.wrap will receive a data set structured
# as dat as well as newdata but with the class label variable dropped
# note that inside the wrapper we dispense to knn 
# the needed quantities accordingly

knn.wrap <- function(data, newdata, ...)
{
   knn(train=data[,-1], test=newdata, cl=data[,1], ...)
}

# optional arguments to knn.wrap may be specified in control.learner
ROSE.eval(formula=cls~., data=dat, learner=knn.wrap,
          subset=c(1:50, 981:1000), method.assess="holdout", 
          control.learner=list(k=2, prob=T), seed=1)

# if we swap the columns of dat we have to change the wrapper accordingly
dat <- dat[,c("x1","x2","cls")]

# now class label variable is the last one
knn.wrap <- function(data, newdata, ...)
{
   knn(train=data[,-3], test=newdata, cl=data[,3], ...)
}

ROSE.eval(formula=cls~., data=dat, learner=knn.wrap,
          subset=c(1:50, 981:1000), method.assess="holdout", 
          control.learner=list(k=2, prob=T), seed=1)

# }

Run the code above in your browser using DataLab