cvTool: Low-level function for cross-validation

Description

Basic function to estimate the prediction error of a model via (repeated) \(K\)-fold cross-validation. The model is thereby specified by an unevaluated function call to a model fitting function.

Usage

cvTool(
  call,
  data = NULL,
  x = NULL,
  y,
  cost = rmspe,
  folds,
  names = NULL,
  predictArgs = list(),
  costArgs = list(),
  envir = parent.frame()
)

Value

If only one replication is requested and the prediction loss function cost also returns the standard error, a list is returned, with the first component containing the estimated prediction errors and the second component the corresponding estimated standard errors.

Otherwise the return value is a numeric matrix in which each column contains the respective estimated prediction errors from all replications.

Arguments

call: an unevaluated function call for fitting a model (see call).
data: a data frame containing the variables required for fitting the models. This is typically used if the model in the function call is described by a formula.
x: a numeric matrix containing the predictor variables. This is typically used if the function call for fitting the models requires the predictor matrix and the response to be supplied as separate arguments.
y: a numeric vector or matrix containing the response.
cost: a cost function measuring prediction loss. It should expect the observed values of the response to be passed as the first argument and the predicted values as the second argument, and must return either a non-negative scalar value, or a list with the first component containing the prediction error and the second component containing the standard error. The default is to use the root mean squared prediction error (see cost).
folds: an object of class "cvFolds" giving the folds of the data for cross-validation (as returned by cvFolds).
names: an optional character vector giving names for the arguments containing the data to be used in the function call (see “Details”).
predictArgs: a list of additional arguments to be passed to the predict method of the fitted models.
costArgs: a list of additional arguments to be passed to the prediction loss function cost.
envir: the environment in which to evaluate the function call for fitting the models (see eval).

Author

Andreas Alfons

Details

(Repeated) \(K\)-fold cross-validation is performed in the following way. The data are first split into \(K\) previously obtained blocks of approximately equal size (given by folds). Each of the \(K\) data blocks is left out once to fit the model, and predictions are computed for the observations in the left-out block with the predict method of the fitted model. Thus a prediction is obtained for each observation.

The response variable and the obtained predictions for all observations are then passed to the prediction loss function cost to estimate the prediction error. For repeated cross-validation (as indicated by folds), this process is replicated and the estimated prediction errors from all replications are returned.

Furthermore, if the response is a vector but the predict method of the fitted models returns a matrix, the prediction error is computed for each column. A typical use case for this behavior would be if the predict method returns predictions from an initial model fit and stepwise improvements thereof.

If data is supplied, all variables required for fitting the models are added as one argument to the function call, which is the typical behavior of model fitting functions with a formula interface. In this case, a character string specifying the argument name can be passed via names (the default is to use "data").

If x is supplied, on the other hand, the predictor matrix and the response are added as separate arguments to the function call. In this case, names should be a character vector of length two, with the first element specifying the argument name for the predictor matrix and the second element specifying the argument name for the response (the default is to use c("x", "y")). It should be noted that data takes precedence over x if both are supplied.

Examples

Run this code

library("robustbase")
data("coleman")
set.seed(1234)  # set seed for reproducibility

# set up function call for an MM regression model
call <- call("lmrob", formula = Y ~ .)
# set up folds for cross-validation
folds <- cvFolds(nrow(coleman), K = 5, R = 10)

# perform cross-validation
cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe, 
    folds = folds, costArgs = list(trim = 0.1))

Run the code above in your browser using DataLab