h.cv: Cross-validation methods for bandwidth selection

Description

Selects the bandwidth of a local polynomial kernel (regression, density or variogram) estimator using (standard or modified) CV, GCV or MASE criteria.

Usage

h.cv(bin, ...)
# S3 method for bin.data
h.cv(
  bin,
  objective = c("CV", "GCV", "MASE"),
  h.start = NULL,
  h.lower = NULL,
  h.upper = NULL,
  degree = 1,
  ncv = ifelse(objective == "CV", 2, 0),
  cov.bin = NULL,
  DEalgorithm = FALSE,
  warn = TRUE,
  tol.mask = npsp.tolerance(2),
  ...
)
# S3 method for bin.den
h.cv(
  bin,
  h.start = NULL,
  h.lower = NULL,
  h.upper = NULL,
  degree = 1,
  ncv = 2,
  DEalgorithm = FALSE,
  ...
)
# S3 method for svar.bin
h.cv(
  bin,
  loss = c("MRSE", "MRAE", "MSE", "MAE"),
  h.start = NULL,
  h.lower = NULL,
  h.upper = NULL,
  degree = 1,
  ncv = 1,
  DEalgorithm = FALSE,
  warn = FALSE,
  ...
)
hcv.data(
  bin,
  objective = c("CV", "GCV", "MASE"),
  h.start = NULL,
  h.lower = NULL,
  h.upper = NULL,
  degree = 1,
  ncv = ifelse(objective == "CV", 1, 0),
  cov.dat = NULL,
  DEalgorithm = FALSE,
  warn = TRUE,
  ...
)

Value

Returns a list containing the following 3 components:

h: the best (diagonal) bandwidth matrix found.
value: the value of the objective function corresponding to h.
objective: the criterion used.

Arguments

bin: object used to select a method (binned data, binned density or binned semivariogram).
...: further arguments passed to or from other methods (e.g. parameters of the optimization routine).
objective: character; optimal criterion to be used ("CV", "GCV" or "MASE").
h.start: vector; initial values for the parameters (diagonal elements) to be optimized over. If DEalgorithm == FALSE (otherwise not used), defaults to (3 + ncv) * lag, where lag = bin$grid$lag.
h.lower: vector; lower bounds on each parameter (diagonal elements) to be optimized. Defaults to (1.5 + ncv) * bin$grid$lag.
h.upper: vector; upper bounds on each parameter (diagonal elements) to be optimized. Defaults to 1.5 * dim(bin) * bin$grid$lag.
degree: degree of the local polynomial used. Defaults to 1 (local linear estimation).
ncv: integer; determines the number of cells leaved out in each dimension. (0 to GCV considering all the data, $>0$ to traditional or modified cross-validation). See "Details" bellow.
cov.bin: (optional) covariance matrix of the binned data or semivariogram model (svarmod-class) of the (unbinned) data. Defaults to the identity matrix.
DEalgorithm: logical; if TRUE, the differential evolution optimization algorithm in package DEoptim is used.
warn: logical; sets the handling of warning messages (normally due to the lack of data in some neighborhoods). If FALSE all warnings are ignored.
tol.mask: tolerance used in the aproximations. Defaults to npsp.tolerance(2).
loss: character; CV error. See "Details" bellow.
cov.dat: covariance matrix of the data or semivariogram model (of class extending svarmod). Defaults to the identity matrix (uncorrelated data).

Details

Currently, only diagonal bandwidths are supported.

h.cv methods use binning approximations to the objective function values (in almost all cases, an averaged squared error). If ncv > 0, estimates are computed by leaving out binning cells with indexes within the intervals $[x_i - ncv + 1, x_i + ncv - 1]$, at each dimension i, where $x$ denotes the index of the estimation location. $ncv = 1$ corresponds with traditional cross-validation and $ncv > 1$ with modified CV (it may be appropriate for dependent data; see e.g. Chu and Marron, 1991, for the one dimensional case). Setting ncv >= 2 would be recommended for sparse data (as linear binning is used). For standard GCV, set ncv = 0 (the whole data would be used). For theoretical MASE, set bin = binning(x, y = trend.teor), cov = cov.teor and ncv = 0.

If DEalgorithm == FALSE, the "L-BFGS-B" method in optim is used.

The different options for the argument loss in h.cv.svar.bin() define the CV error considered in semivariogram estimation:

"MSE": Mean squared error
"MRSE": Mean relative squared error
"MAE": Mean absolute error
"MRAE": Mean relative absolute error

hcv.data evaluates the objective function at the original data (combining a binning approximation to the nonparametric estimates with a linear interpolation), this can be very slow (and memory demanding; consider using h.cv instead). If ncv > 1 (modified CV), a similar algorithm to that in h.cv is used, estimates are computed by leaving out binning cells with indexes within the intervals $[x_i - ncv + 1, x_i + ncv - 1]$.

References

Chu, C.K. and Marron, J.S. (1991) Comparison of Two Bandwidth Selectors with Dependent Errors. The Annals of Statistics, 19, 1906-1918.

Francisco-Fernandez M. and Opsomer J.D. (2005) Smoothing parameter selection methods for nonparametric regression with spatially correlated errors. Canadian Journal of Statistics, 33, 539-558.

Examples

Run this code

# Trend estimation
bin <- binning(earthquakes[, c("lon", "lat")], earthquakes$mag)
hcv <- h.cv(bin, ncv = 2)
lp <- locpol(bin, h = hcv$h)
# Alternatively, `locpolhcv()` could be called instead of the previous code. 

simage(lp, main = 'Smoothed magnitude')
contour(lp, add = TRUE)
with(earthquakes, points(lon, lat, pch = 20))

# Density estimation
hden <- h.cv(as.bin.den(bin))
den <- np.den(bin, h = hden$h)

plot(den, main = 'Estimated log(density)')

Run the code above in your browser using DataLab