errorest: Estimators for the Prediction Error

Description

Resampling based estimates of prediction error (misclassification or mean squared error).

Usage

errorest(formula, data, subset, na.action, model=NULL, predict=NULL,
         iclass=NULL, estimator=c("cv", "boot", "632plus"), 
         est.para=list(k = 10, nboot = 25), ...)

Arguments

formula

formula. Either describing the model of explanatory and response variables in the usual way (see lm) or the model between explanatory and intermediate variables

data

data frame containing the variables in the model formula and additionally the class membership variable if model = inclass. data is required for indirect classification, otherwise

subset

optional vector, specifying a subset of observations to be used.

na.action

function. Indicates what should happen when the data contain NAs.

model

function. Modelling technique whose error rate is to be estimated. The parameter na.action is evaluated in the modelling process.

predict

function. Prediction method to be used. The vector of predicted values must have the same length as the the number of to-be-predicted observations. Predictions corresponding to missing data must be rep

iclass

character. Specifying the class membership variable (response) in data in the framework of indirect classification.

estimator

estimator of the misclassification error: cv cross-validation, boot bootstrap or 632plus bias corrected bootstrap (classification only).

est.para

a list of additional parameters for the estimator. k for k-fold cross-validations or nboot for the number of bootstrap replications.

...

additional parameters to model.

Value

An object of class errorest, i.e. a list with arguments:
errestimated misclassification error for a nominal or the square root of the estimated mean squared error for a continuous response.
estimatorkind of estimator used.
paraadditional parameters for the estimator.
data.namenames of the variables used.
classlogical. TRUE for classification problems.
sdjackknife estimate of internal standard deviation of err (if estimator = "boot").

Details

The prediction error for classification and regression models using cross-validation or the bootstrap can be computed by errorest. Any model can be specified as long as this is a function with arguments model(formula, data, subset, na.action, ...). If model is generic and a model.predict(object, newdata, ...) is available, predict does not need to be specified. However, predict has to return predicted values directly comparable to the responses. See the examples below.

k-fold cross-validation and the usual bootstrap estimator with est.para$nboot bootstrap replications can be computed for classification and regression problems. The bias corrected .632+ bootstrap by Efron and Tibshirani (1997) is available for classification problems only.

print.errorest is available for inspection of the results.

References

Bradley Efron and Robert Tibshirani (1997), Improvements on Cross-Validation: The .632+ Bootstrap Estimator. Journal of the American Statistical Association 92(438), 548--560.

Brian D. Ripley (1996), Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

David J. Hand, Hua Gui Li, Niall M. Adams (2001), Supervised classification with structured class definitions. Computational Statistics & Data Analysis 36, 209--225.

Examples

Run this code

X <- as.data.frame(matrix(rnorm(1000), ncol=10))
y <- factor(ifelse(apply(X, 1, mean) > 0, 1, 0))
learn <- cbind(y, X)

mypredict.lda <- function(object, newdata)
  predict(object, newdata = newdata)$class

errorest(y ~ ., data= learn, model=lda, 
         estimator = "cv", predict= mypredict.lda)

# n-fold cv = leave-one-out.

errorest(y ~ ., data= learn, model=lda, 
         estimator = "cv", est.para=list(k = nrow(learn)), 
         predict= mypredict.lda)

errorest(y ~ ., data= learn, model=lda, 
         estimator = "boot", predict= mypredict.lda)

errorest(y ~ ., data= learn, model=lda, 
         estimator = "632plus", predict= mypredict.lda)

attach(learn)
errorest(y ~ V1 + V2 + V3, model=lda, estimator = "cv",
         predict= mypredict.lda)
detach(learn)


mypredict.rpart <- function(object, newdata)
  predict(object, newdata = newdata,type="class")

errorest(y ~ ., data= learn, model=rpart, estimator = "cv",
         predict=mypredict.rpart)

errorest(y ~ ., data= learn, model=rpart, estimator = "boot",
predict=mypredict.rpart)

errorest(y ~ ., data= learn, model=rpart, estimator = "632plus",
predict=mypredict.rpart)

errorest(y ~ ., data= learn, model=bagging, estimator = "cv",
nbagg=10)

data(Glass)

# LDA has cross-validated misclassification error of 
# 38\% (Ripley, 1996, page 98)


# Pruned trees about 32\% (Ripley, 1996, page 230)

pruneit <- function(formula, ...)
  prune(rpart(formula, ...), cp =0.01)

errorest(Type ~ ., data=Glass, model=pruneit, estimator= "cv",
predict=mypredict.rpart)

data(smoking)
# Set three groups of variables:
# 1) explanatory variables are: TarY, NicY, COY, Sex, Age
# 2) intermediate variables are: TVPS, BPNL, COHB
# 3) response (resp) is defined by:

resp <- function(data){
  res <- t(t(data) > c(4438, 232.5, 58))
  res <- as.factor(ifelse(apply(res, 1, sum) > 2, 1, 0))
  res
}

response <- resp(smoking[ ,c("TVPS", "BPNL", "COHB")])
smoking <- cbind(smoking, response)

formula <- TVPS+BPNL+COHB~TarY+NicY+COY+Sex+Age

mypredict.inclass <- function(object, newdata){
  res <- predict.inclass(object = object, cFUN = resp, newdata = newdata)
  return(res)
}

# Estimation per leave-one-out estimate for the misclassification is 
# 36.36\% (Hand et al., 2001), using indirect classification with 
# linear models

errorest(formula, data = smoking, model = inclass, predict = mypredict.inclass,
         estimator = "cv", iclass = "response", pFUN = lm,
         est.para=list(k=nrow(smoking)))

Run the code above in your browser using DataLab