gcv: Estimate EPE Using Delete-d Cross-Validation

Description

This is a general purpose function to estimate the EPE of a specified cost function in regression and classification problems. For regression, the default cost function is for mean-square error and for classification it is the misclassification rate. Direct support for elastic penalty regression, LASSO, PCR, PLSR, nearest neighbour and Random Forest regression are included in the package. And for classification, built-in support functions are provided for LDA, QDA, Naive Bayes, kNN, CART, C5.0, Random Forest and SVM. Examples included in vignette section are provided for SCAD, MCP and best subset regression. Illustrative example datasets and data generation models are also provided.

Usage

gcv(X, y, MaxIter = 1000, d = ceiling(length(y)/10), NCores = 1, cost = mse,  yhat = yhat_lm, libs = character(0), seed = "default", ...)

Arguments

inputs, matrix or dataframe

output vector

MaxIter

Number of iterations of the CV procedure

Number of observations for the hold-out sample

NCores

Default is 1 which does not use the parallel package. Otherwise, you can set to the number of cores available. If unsure, just experiment!

cost

Average cost. See examples mse, mae, mape.

yhat

In general it must be a function with arguments dfTrain and dfTest. See examples below.

libs

Required libraries needed for the predictor.

seed

Default is to use R's default which is based on the current time. Otherwise set to an integer value. See Details.

...

Additional arguments that are passed to yhat.

Value

are respectively the estimated EPE, standard deviation of this estimate, an estimate of the snr (signal-to-noise ratio) out-of-sample and an out-of-sample estimate of the correlation between the prediction and the true value.

Details

If only serial evaluation was implemented then I would have used set.seed to control the random. But I have included it as an argument since it can be used to set the parallel random number generator seed. This is sometimes useful for replicating the simulations. If the argument seed is used, it will also set the seed when only serial computation is done.

References

ESL

Examples

Run this code

#Simple example but in general, MaxIter >= 1000 is recommended.
Xy <- ShaoReg()
gcv(Xy[,1:8], Xy[,9], MaxIter=25, d=5)

Run the code above in your browser using DataLab