gam: Generalized Additive Models using penalized regression splines and GCV

Description

Fits the specified generalized additive model (GAM) to data. gam() is not a clone of what Splus provides. Smooth terms are represented using penalized regression splines with smoothing parameters selected by GCV or by regression splines with fixed degrees of freedom (mixtures of the two are permitted). Multi-dimensional smooths are available using penalized thin plate regression splines, but the user must make sure that covariates are sensibly scaled relative to each other when using such terms. For a general overview see Wood (2001).

Usage

gam(formula,family=gaussian(),data=list(),weights=NULL,control=gam.control,scale=0)

Arguments

formula

A GAM formula. This is exactly like the formula for a glm except that smooth terms can be added to the right hand side of the formula (and a formula of the form y ~ . is not allowed). Smooth terms are specified by expressions of the form:

family

This is a family object specifying the distribution and link to use in fitting etc. See glm and family for more details. Where the family is

data

A data frame containing the model response variable and covariates required by the formula. If this is missing then the frame from which gam was called is searched for the variables specified in the formula.

weights

prior weights on the data.

control

A list as returned by gam.control, with five user controllable elements: maxit controls maximum iterations in gam.fit, convergence tolerance in gam.fit is controlled by epsilon

scale

If this is zero then GCV is used for all distributions except Poisson, binomial and negative binomial where UBRE is used with scale parameter assumed to be 1. If this is greater than 1 it is assumed to be the scale parameter/variance and UBRE is used. If

Value

The function returns an object of class "gam" which has the following elements:
coefficientsthe coefficients of the fitted model. Parametric coefficients are first, followed by coefficients for each spline term in turn.
residualsthe deviance residuals for the fitted model.
fitted.valuesfitted model predictions of expected value for each datum.
familyfamily object specifying distribution and link used.
linear.predictorfitted model prediction of link function of expected value for each datum.
deviance(unpenalized)
null.deviancedeviance for single parameter model.
df.nullnull degrees of freedom
iternumber of iterations of IRLS taken to get convergence.
weightsfinal weights used in IRLS iteration.
prior.weightsprior weights on observations.
df.nullnumber of data
yresponse data.
convergedindicates whether or not the iterative fitting method converged.
sig2estimated or supplied variance/scale parameter.
edfestimated degrees of freedom for each smooth.
boundarydid parameters end up at boundary of parameter space?
spsmoothing parameter for each smooth.
dfnumber of knots for each smooth (one more than maximum degrees of freedom).
nsdfnumber of parametric, non-smooth, model terms including the intercept.
Vpestimated covariance matrix for parameter estimators.
covariate.shiftcovariates get shifted so that they are centred around zero - this is by how much.
xpknot locations for each cubic regression spline based smooth. xp[i,] are the locations for the ith smooth.
UZarray storing the matrices for transforming from t.p.r.s. basis to equivalent t.p.s. basis - see GAMsetup for details of how the matrices are packed in this array.
XuThe set of unique covariate locations used to define t.p.s. from which t.p.r.s. basis was derived. Again see GAMsetup for details of the packing algorithm.
xu.lengthThe number of unique covariate combinations in the data.
formulathe model formula.
full.formulathe model formula with each smooth term fully expanded and with option arguments given explicitly (i.e. not with reference to other variables) - useful for later prediction from the model.
xparametric design matrix columns (including intercept) followed by the data that form arguments of the smooths.
s.typetype of spline basis used: 0 for conventional cubic regression spline, 1 for t.p.r.s.
p.orderthe order of the penalty used for each term. 0 signals auto-selection.
dimnumber of covariates of which term is a function
calla mode call object containing the call to gam() that produced this gam object (useful for constructing model frames).
mgcv.convA list of smoothing parameter convergence diagnostics, with the following elements (irrelevant for models with only one smoothing parameter to estimate): g{the gradient of the GCV/UBRE score w.r.t. the smoothing parameters at termination.} h{the second derivatives corresponding to g above - i.e. the leading diagonal of the Hessian.} e{the eigen-values of the Hessian. These should all be non-negative!} iter{the number of iterations taken.} in.ok{TRUE if the second smoothing parameter guess improved the GCV/UBRE score. (Please report examples where this is FALSE)} step.fail{TRUE if the algorithm terminated by failing to improve the GCV/UBRE score rather than by "converging". Not necessarily a problem, but check the above derivative information quite carefully.}

WARNINGS

The code does not check for rank defficiency of the model matrix -it will likely just fail instead!

You must have more unique combinations of covariates than the model has total parameters. (Total parameters is sum of basis dimensions plus sum of non-spline terms less the number of spline terms).

Automatic smoothing parameter selection is not likely to work well when fitting models to very few response data.

Relative scaling of covariates to a multi-dimensional smooth term has an affect on the results: make sure that relative scalings are sensible. For example, measuring one spatial co-ordinate in millimetres and the other in lightyears will usually produce poor results.

With large datasets (more than a few thousand data) the "tp" basis gets very slow to use. In this case use "cr" for 1-d smooths. If you need to use multi-dimensional terms with large datasets and find gam too slow, please let me know - and I'll up the priority for fixing this!

Details

Two alternative bases are available for representing model terms. Univariate smooth terms can be represented using conventional cubic regression splines - which are very efficient computationally - or thin plate regression splines. Multivariate terms must be represented using thin plate regression splines. For either basis the user specifies the dimension of the basis for each smooth term. The dimension of the basis is one more than the maximum degrees of freedom that the term can have, but usually the term will be fitted by penalized maximum likelihood estimation and the actual degrees of freedom will be chosen by GCV. However, the user can choose to fix the degrees of freedom of a term, in which case the actual degrees of freedom will be one less than the basis dimension.

Thin plate regression splines are constructed by starting with the basis for a full thin plate spline and then truncating this basis in an optimal manner, to obtain a low rank smoother. Details are given in Wood (MS submitted). One key advantage of the approach is that it avoids the knot placement problems of conventional regression spline modelling, but it also has the advantage that smooths of lower rank are nested within smooths of higher rank, so that it is legitimate to use conventional hypothesis testing methods to compare models based on pure regression splines. In the case of the cubic regression spline basis, knots of the spline are placed evenly throughout the covariate values to which the term refers: For example, if fitting 101 data with an 11 knot spline of x then there would be a knot at every 10th (ordered) x value. The parameterization used represents the spline in terms of its values at the knots. Connection of these values at neighbouring knots by sections of cubic polynomial constrainted to join at the knots so as to be continuous up to and including second derivative yields a natural cubic spline through the values at the knots (given two extra conditions specifying that the second derivative of the curve should be zero at the two end knots). This parameterization gives the parameters a nice interpretability.

Given a basis for each smooth term, it easy to obtain a wiggliness penalty for each, and to construct a penalized likelihood, which balances the fit of the overall model against it's complexity. This consists of the log likelihood for the model minus a sum of wiggliness penalties (one for each smooth) each multiplied by a smoothing parameter. The smoothing parameters control the trade-off between fit and smoothness. So, the gam fitting problem has become a penalized glm fitting problem, which can be fitted using a slight modification of glm.fit : gam.fit. The penalized glm approach also allows smoothing parameters for all smooth terms to be selected simultaneously by GCV or UBRE. This is achieved as part of fitting by calling mgcv within gam.fit. Details of the GCV/UBRE minimization method are given in Wood (2000).

References

Hastie and Tibshirani (1990) Generalized Additive Models. Chapman and Hall.

Green and Silverman (1994) Nonparametric Regression and Generalized Linear Models. Chapman and Hall. Gu and Wahba (1991) Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method. SIAM J. Sci. Statist. Comput. 12:383-398

Wood (2000) Modelling and Smoothing Parameter Estimation with Multiple Quadratic Penalties. JRSSB 62(2):413-428

Wood (2001) mgcv:GAMs and Generalized Ridge Regression for R. R News 1(2):20-25 Wood (MS submitted) Thin Plate Regression Splines

Wahba (1990) Spline Models of Observational Data. SIAM

http://www.ruwpa.st-and.ac.uk/simon.html

Examples

Run this code

library(mgcv)
set.seed(1)
n<-400
sig2<-4
x0 <- runif(n, 0, 1)
x1 <- runif(n, 0, 1)
x2 <- runif(n, 0, 1)
x3 <- runif(n, 0, 1)
pi <- asin(1) * 2
f <- 2 * sin(pi * x0)
f <- f + exp(2 * x1) - 3.75887
f <- f + 0.2 * x2^11 * (10 * (1 - x2))^6 + 10 * (10 * x2)^3 * (1 - x2)^10 - 1.396
e <- rnorm(n, 0, sqrt(abs(sig2)))
y <- f + e
b<-gam(y~s(x0)+s(x1)+s(x2)+s(x3))
plot(b,pages=1)
# now a GAM with 3df regression spline term & 2 penalized terms
b1<-gam(y~s(x0,k=4,fx=TRUE,bs="tp")+s(x1,k=12)+s(x2,15))
plot(b1,pages=1)
# now fit a 2-d term to x0,x1
b3<-gam(y~s(x0,x1)+s(x2)+s(x3))
par(mfrow=c(2,2))
plot(b3)
par(mfrow=c(1,1))
# now simulate poisson data
g<-exp(f/5)
y<-rpois(rep(1,n),g)
b2<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=poisson)
plot(b2,pages=1)
# negative binomial data
set.seed(1)
y<-rnbinom(g,size=2,mu=g)
b3<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=neg.binom)
plot(b3,pages=1)
# and a pretty 2-d smoothing example....
test1<-function(x,z,sx=0.3,sz=0.4)  
{ (pi**sx*sz)*(1.2*exp(-(x-0.2)^2/sx^2-(z-0.3)^2/sz^2)+
  0.8*exp(-(x-0.7)^2/sx^2-(z-0.8)^2/sz^2))
}
n<-500
old.par<-par(mfrow=c(2,2))
x<-runif(n);z<-runif(n);
xs<-seq(0,1,length=30);zs<-seq(0,1,length=30)
pr<-data.frame(x=rep(xs,30),z=rep(zs,rep(30,30)))
truth<-matrix(test1(pr$x,pr$z),30,30)
contour(xs,zs,truth)
y<-test1(x,z)+rnorm(n)*0.1
b4<-gam(y~s(x,z))
fit1<-matrix(predict.gam(b4,pr,se=FALSE),30,30)
contour(xs,zs,fit1)
persp(xs,zs,truth)
persp(b4)
par(old.par)

Run the code above in your browser using DataLab