NCV estimates smoothing parameters by optimizing the average ability of a model to predict subsets of data when subsets of data are omitted from fitting. Usually the predicted subset is a subset of the omitted subset. If both subsets are the same single datapoint, and the average is over all datapoints, then NCV is leave-one-out cross validation. QNCV is a quadratic approximation to NCV, guaranteed finite for any family link combination.
In detail, suppose that a model is estimated by minimizing a penalized loss $$\sum_i D(y_i,\theta_i) + \sum_j \lambda_j \beta^{\sf T} {S}_j \beta $$ where \(D\) is a loss (such as a negative log likelihood), dependent on response \(y_i\) and parameter vector \(\theta_i\), which in turn depends on covariates via one or more smooth linear predictors with coefficients \(\beta\). The quadratic penalty terms penalize model complexity: \(S_j\) is a known matrix and \(\lambda_j\) an unknown smoothing parameter. Given smoothing parameters the penalized loss is readily minimized to estimate \(\beta\).
The smoothing parameters also have to be estimated. To this end, choose \(k = 1,\ldots,m\) subsets \(\alpha(k)\subset \{1,\ldots,n\}\) and \(\delta(k)\subset \{1,\ldots,n\}\). Usually \(\delta(k)\) is a subset of (or equal to) \(\alpha(k)\). Let \(\theta_i^{\alpha(k)}\) denote the estimate of \(\theta_i\) when the points indexed by \(\alpha(k)\) are omitted from fitting. Then the NCV criterion $$V = \sum_{k=1}^m \sum_{i \in \delta(k)} D(y_i,\theta_i^{\alpha(k)})$$ is minimized w.r.t. the smoothing parameters, \(\lambda_j\). If \(m=n\) and \(\alpha(k)=\delta(k)=k\) then ordinary leave-one-out cross validation is recovered. This formulation covers many of the variants of cross validation reviewed in Arlot and Celisse (2010), for example.
Except for a quadratic loss, \(V\) can not be computed exactly, but it can be computed to \(O(n^{-2})\) accuracy (fixed basis size), by taking single Newton optimization steps from the full data \(\beta\) estimates to the equivalent when each \(\alpha(k)\) is dropped. This is what mgcv
does. The Newton steps require update of the full model Hessian to the equivalent when each datum is dropped. This can be achieved at \(O(p^2)\) cost, where \(p\) is the dimension of \(\beta\). Hence, for example, the ordinary cross validation criterion is computable at the \(O(np^2)\) cost of estimating the model given smoothing parameters.
The NCV score computed in this way is optimized using a BFGS quasi-Newton method, adapted to the case in which smoothing parameters tending to infinity may cause indefiniteness.
For bam(...,discrete=TRUE)
NCV can be used to estimate the smoothing parameters of the working penalized weighted linear model. This is typically very expensive relative to REML estimation, but the cost can be mitigated by basing the NCV criterion on a sample (usually random of the neighbourhoods). Note that in the case in which leave-out-neighbourhood CV is used to reduce the effects of autocorrelation, it is important to supply a neighbourhood for each observation, and to supply a sample
element in nei
. This ensures that the parameter covariance matrix is estimated correctly.
A routine applied problem is that smoothing parameters tend to be underestimated in the presence of un-modelled short range autocorrelation, as the smooths try to fit the local excursions in the data caused by the local autocorrelation. Cross validation will tend to 'fit the noise' when there is autocorellation, since a model that fits the noise in the data correlated with an omitted datum, will also tend to closely fit the noise in the omitted datum, because of the correlation. That is autocorrelation works against the avoidance of overfit that cross validation seeks to achieve.
For short range autocorrelation the problems can be avoided, or at least mitigated, by predicting each datum when all the data in its `local' neighbourhood are omitted. The neighbourhoods being constructed in order that un-modelled correlation is minimized between the point of interest and points outside its neighbourhood. That is we set \(m=n\), \(\delta(k)=k\) and \(\alpha(k) = {\tt nei}(k)\), where nei(k)
are the indices of the neighbours of point \(k\). This approach has been known for a long time (e.g. Chu and Marron, 1991; Robert et al. 2017), but was previously rather too expensive for regular use for smoothing parameter estimation.
The neighbourhood subsets \(\alpha(k)\) and \(\delta(k)\) have to be supplied to gam
, and the nei
argument does this. It is a list with the following arguments.
a
is the vector of indices to be dropped for each neighbourhood.
ma
gives the end of each neighbourhood. So nei$a[(nei$ma[j-1]+1):nei$ma[j]]
gives the points dropped for the neighbourhood j
: that is \(\alpha(j)\).
d
is the vector of indices of points to predict.
md
gives the corresponding endpoints md
. So nei$d[(nei$md[j-1]+1):nei$md[j]]
indexes the points to predict for neighbourhood j: that is \(\delta(j)\).
sample
is an optional element used by bam
. If it is a single number it gives the number of neighbourhoods to randomly sample to construct the NCV criterion. If it is a vector then it should contain the indices of the neighbourhoods to use.
jackknife
is an optional element used by gam
. If supplied and TRUE
then variance estimates are based on the raw Jackkife estimate, if FALSE
then on the standard Bayesian results. If not supplied (usual) then an estimator accounting for the neighbourhood structure is used, that largely accounts for any correlation present within neighbourhoods. jackknife
is ignored if NCV is being calculated for a model where another method is used for smoothing parameter selection.
If nei==NULL
(or a
or ma
are missing) then leave-one-out cross validation is used. If nei
is supplied but NCV is not selected as the smoothing parameter estimation method, then it is simply computed (but not optimized).
If a model is specified in which some coefficient values, \(\beta\), have non-finite likelihood then the NCV criterion computed with single Newton steps could also be non-finite. A simple fix replaces the NCV criterion with a quadratic approximation to the criterion around the full data fit. The quadratic approximation is always finite. This 'QNCV' is essential for some families, such as gevlss
. QNCV is not needed for bam
estimation.
Although the leading order cost of NCV is the same as REML or GCV, the actual cost is higher because the dominant operations costs are in matrix-vector, rather than matrix-matrix, operations, so BLAS speed ups are small. However multi-core computing is worthwhile for NCV. See the option ncv.threads
in gam.control
.
Simon N. Wood simon.wood@r-project.org
Chu and Marron (1991) Comparison of two bandwidth selectors with dependent errors. The Annals of Statistics. 19, 1906-1918
Arlot, S. and A. Celisse (2010). A survey of cross-validation procedures for model selection. Statistics Surveys 4, 40-79
Roberts et al. (2017) Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40(8), 913-929.
Wood S.N. (2023) On Neighbourhood Cross Validation. https://arxiv.org/abs/2404.16490
require(mgcv)
nei.cor <- function(h,n) { ## construct nei structure
nei <- list(md=1:n,d=1:n)
nei$ma <- cumsum(c((h+1):(2*h+1),rep(2*h+1,n-2*h-2),(2*h+1):(h+1)))
a0 <- rep(0,0); if (h>0) for (i in 1:h) a0 <- c(a0,1:(h+i))
a1 <- n-a0[length(a0):1]+1
nei$a <- c(a0,1:(2*h+1)+rep(0:(n-2*h-1),each=2*h+1),a1)
nei
}
set.seed(1)
n <- 500;sig <- .6
x <- 0:(n-1)/(n-1)
f <- sin(4*pi*x)*exp(-x*2)*5/2
e <- rnorm(n,0,sig)
for (i in 2:n) e[i] <- 0.6*e[i-1] + e[i]
y <- f + e ## autocorrelated data
nei <- nei.cor(4,n) ## construct neighbourhoods to mitigate
b0 <- gam(y~s(x,k=40)) ## GCV based fit
gc <- gam.control(ncv.threads=2)
b1 <- gam(y~s(x,k=40),method="NCV",nei=nei,control=gc)
## use "QNCV", which is identical here...
b2 <- gam(y~s(x,k=40),method="QNCV",nei=nei,control=gc)
## plot GCV and NCV based fits...
f <- f - mean(f)
par(mfrow=c(1,2))
plot(b0,rug=FALSE,scheme=1);lines(x,f,col=2)
plot(b1,rug=FALSE,scheme=1);lines(x,f,col=2)
Run the code above in your browser using DataLab