Learn R Programming

boot (version 1.2-10)

cv.glm: Cross-validation for Generalized Linear Models

Description

This function calculates the estimated K-fold cross-validation prediction error for generalized linear models.

Usage

cv.glm(data, glmfit, cost, K)

Arguments

data
A matrix or dataframe containing the data. The rows should be cases and the columns correspond to variables, one of which is the response.
glmfit
An object of class "glm" containing the results of a generalized linear model fitted to data.
cost
A function of two vector arguments specifying the cost function for the cross-validation. The first argument to cost should correspond to the observed responses and the second argument should correspond to the predicted or fitted responses
K
The number of groups into which the data should be split to estimate the cross-validation prediction error. The value of K must be such that all groups are of approximately equal size. If the supplied value of K does not satisf

Value

  • The returned value is a list with the following components.
  • callThe original call to cv.glm.
  • KThe value of K used for the K-fold cross validation.
  • deltaA vector of length two. The first component is the raw cross-validation estimate of prediction error. The second component is the adjusted cross-validation estimate. The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.
  • seedThe value of .Random.seed when cv.glm was called.

Side Effects

The value of .Random.seed is updated.

Details

The data is divided randomly into K groups. For each group the generalized linear model is fit to data omitting that group, then the function cost is applied to the observed responses in the group that was omitted from the fit and the prediction made by the fitted models for those observations.

When K is the number of observations leave-one-out cross-validation is used and all the possible splits of the data are used. When K is less than the number of observations the K splits to be used are found by randomly partitioning the data into K groups of approximately equal size. In this latter case a certain amount of bias is introduced. This can be reduced by using a simple adjustment (see equation 6.48 in Davison and Hinkley, 1997). The second value returned in delta is the estimate adjusted by this method.

References

Brieman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984) Classification and Regression Trees. Wadsworth.

Burman, P. (1989) A comparitive study of ordinary cross-validation, v-fold cross-validation and repeated learning-testing methods. Biometrika, 76, 503--514

Davison, A.C. and Hinkley, D.V. (1997) Bootstrap Methods and Their Application. Cambridge University Press.

Efron, B. (1986) How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association, 81, 461--470.

Stone, M. (1974) Cross-validation choice and assessment of statistical predictions (with Discussion). Journal of the Royal Statistical Society, B, 36, 111--147.

See Also

glm, glm.diag, predict

Examples

Run this code
# leave-one-out and 6-fold cross-validation prediction error for 
# the mammals data set.
data(mammals, package="MASS")
mammals.glm <- glm(log(brain)~log(body),data=mammals)
cv.err <- cv.glm(mammals,mammals.glm)
cv.err.6 <- cv.glm(mammals, mammals.glm, K=6)


# As this is a linear model we could calculate the leave-one-out 
# cross-validation estimate without any extra model-fitting.
muhat <- mammals.glm$fitted
mammals.diag <- glm.diag(mammals.glm)
cv.err <- mean((mammals.glm$y-muhat)^2/(1-mammals.diag$h)^2)


# leave-one-out and 11-fold cross-validation prediction error for 
# the nodal data set.  Since the response is a binary variable an
# appropriate cost function is
cost <- function(r, pi=0) mean(abs(r-pi)>0.5)


data(nodal)
nodal.glm <- glm(r~stage+xray+acid,binomial,data=nodal)
cv.err <- cv.glm(nodal, nodal.glm, cost, K=nrow(nodal))$delta 
cv.11.err <- cv.glm(nodal, nodal.glm, cost, K=11)$delta

Run the code above in your browser using DataLab