gam()
is not a clone of what Splus provides.
Smooth terms are represented using penalized regression splines
with smoothing parameters selected by GCV/UBRE or by regression splines with
fixed degrees of freedom (mixtures of the two are
permitted). Multi-dimensional smooths are available using penalized thin plate
regression splines, but the user must make sure that covariates are sensibly scaled
relative to each other when using such terms. For a general overview see
Wood (2001). For more on specifying models see gam.models
. For more on model
selection see gam.selection
.gam(formula,family=gaussian(),data=list(),weights=NULL,
control=gam.control,scale=0,knots=NULL,sp=NULL)
gam.models
). This is exactly like the formula for a
glm except that smooth terms can be added to the right hand side of the
formula (and a formula of the form y ~ .
igam
was called is
searched for the variables specified in the formula.gam.control
, with five user controllable elements:
maxit
controls maximum iterations in
gam.fit
, convergence tolerance in gam.fit
is controlled by epsilon
cr
basis the user simply supplies the knots to be used, and there must be the same number as the basis
dimension, k
, for t"gam"
which has the following elements:xp[i,]+covariate.shift[i]
are the locations for the ith smooth.by
variables (i.e. covariates that multiply a
smooth term) by[i,j]
is the jth value for the ith by
variable. There are only as many rows of this array as there are
by
variables in the model (often 0). The rownames of by
give the by
variable names.by.exists
{an array of logicals: by.exists[i]
is
TRUE
if the ith smooth has a by
variable associated with
it, FALSE
otherwise.}call
object containing the call to gam()
that produced
this gam
object (useful for constructing model frames).gam.check
), with the following elements (some of which are irrelevant for models with only one smoothing parameter to estimate):
g
above - i.e. the leading diagonal of the Hessian.}
TRUE
if the second smoothing parameter guess improved the GCV/UBRE score.}
TRUE
if the algorithm terminated by failing to improve the GCV/UBRE score rather than by "converging".
Not necessarily a problem, but check the above derivative information quite carefully.}You must have more unique combinations of covariates than the model has total parameters. (Total parameters is sum of basis dimensions plus sum of non-spline terms less the number of spline terms).
Automatic smoothing parameter selection is not likely to work well when fitting models to very few response data.
Relative scaling of covariates to a multi-dimensional smooth term has an affect on the results: make sure that relative scalings are sensible. For example, measuring one spatial co-ordinate in millimetres and the other in lightyears will usually produce poor results.
With large datasets (more than a few thousand data) the "tp"
basis gets very slow to use: use the knots
argument as discussed above and
shown in the examples. Alternatively, for 1-d smooths you can use the "cr"
basis.
Thin plate regression splines are constructed by starting with the
basis for a full thin plate spline and then truncating this basis in
an optimal manner, to obtain a low rank smoother. Details are given in
Wood (2003). One key advantage of the approach is that it avoids
the knot placement problems of conventional regression spline
modelling, but it also has the advantage that smooths of lower rank
are nested within smooths of higher rank, so that it is legitimate to
use conventional hypothesis testing methods to compare models based on
pure regression splines. The t.p.r.s. basis can become expensive to
calculate for large datasets. In this case the user can supply a reduced
set of knots to use in basis construction (see knots, in the argument list).
In the case of the cubic regression spline basis, knots of the spline are placed evenly
throughout the covariate values to which the term refers: For
example, if fitting 101 data with an 11 knot spline of x
then
there would be a knot at every 10th (ordered) x
value. The
parameterization used represents the spline in terms of its
values at the knots. The values at neighbouring knots
are connected by sections of cubic polynomial constrainted to be
continuous up to and including second derivative at the knots. The resulting curve
is a natural cubic spline through the values at the knots (given two extra conditions specifying
that the second derivative of the curve should be zero at the two end
knots). This parameterization gives the parameters a nice interpretability.
Given a basis for each smooth term, it is easy to obtain a wiggliness
penalty for each, and to construct a penalized likelihood, which
balances the fit of the overall model against its
complexity. This consists of the log likelihood for the model minus a
sum of wiggliness penalties (one for each smooth) each multiplied by a
smoothing parameter. The smoothing parameters control the trade-off
between fit and smoothness.
So, the gam fitting problem has become a penalized glm fitting problem, which can be fitted using a
slight modification of glm.fit
: gam.fit
. The penalized
glm approach also allows smoothing parameters for all smooth terms to
be selected simultaneously by GCV or UBRE. This is achieved as
part of fitting by calling mgcv
within gam.fit
.
Details of the GCV/UBRE minimization method are given in Wood (2000): the basis of the approach
is to alternate efficient global optimization with respect to one overall smoothing
parameter with Newton updates of a set of relative smoothing parameters for each smooth term.
Green and Silverman (1994) Nonparametric Regression and Generalized Linear Models. Chapman and Hall. Gu and Wahba (1991) Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method. SIAM J. Sci. Statist. Comput. 12:383-398
Wood (2000) Modelling and Smoothing Parameter Estimation with Multiple Quadratic Penalties. JRSSB 62(2):413-428
Wood (2001) mgcv:GAMs and Generalized Ridge Regression for R. R News 1(2):20-25 Wood (2003) Thin Plate Regression Splines JRSSB 65(1):95-114.
Wahba (1990) Spline Models of Observational Data. SIAM
gam.models
, s
, predict.gam
,
plot.gam
, summary.gam
, gam.side.conditions
,
gam.selection
,mgcv
library(mgcv)
set.seed(0)
n<-400
sig2<-4
x0 <- runif(n, 0, 1)
x1 <- runif(n, 0, 1)
x2 <- runif(n, 0, 1)
x3 <- runif(n, 0, 1)
pi <- asin(1) * 2
f <- 2 * sin(pi * x0)
f <- f + exp(2 * x1) - 3.75887
f <- f + 0.2 * x2^11 * (10 * (1 - x2))^6 + 10 * (10 * x2)^3 * (1 - x2)^10 - 1.396
e <- rnorm(n, 0, sqrt(abs(sig2)))
y <- f + e
b<-gam(y~s(x0)+s(x1)+s(x2)+s(x3))
summary(b)
plot(b,pages=1)
# now a GAM with 3df regression spline term & 2 penalized terms
b0<-gam(y~s(x0,k=4,fx=TRUE,bs="tp")+s(x1,k=12)+s(x2,15))
plot(b0,pages=1)
# now fit a 2-d term to x0,x1
b1<-gam(y~s(x0,x1)+s(x2)+s(x3))
par(mfrow=c(2,2))
plot(b1)
par(mfrow=c(1,1))
# now simulate poisson data
g<-exp(f/5)
y<-rpois(rep(1,n),g)
b2<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=poisson)
plot(b2,pages=1)
# negative binomial data
y<-rnbinom(g,size=2,mu=g)
b3<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=neg.binom)
plot(b3,pages=1)
# and a pretty 2-d smoothing example....
test1<-function(x,z,sx=0.3,sz=0.4)
{ (pi**sx*sz)*(1.2*exp(-(x-0.2)^2/sx^2-(z-0.3)^2/sz^2)+
0.8*exp(-(x-0.7)^2/sx^2-(z-0.8)^2/sz^2))
}
n<-500
old.par<-par(mfrow=c(2,2))
x<-runif(n);z<-runif(n);
xs<-seq(0,1,length=30);zs<-seq(0,1,length=30)
pr<-data.frame(x=rep(xs,30),z=rep(zs,rep(30,30)))
truth<-matrix(test1(pr$x,pr$z),30,30)
contour(xs,zs,truth)
y<-test1(x,z)+rnorm(n)*0.1
b4<-gam(y~s(x,z))
fit1<-matrix(predict.gam(b4,pr,se=FALSE),30,30)
contour(xs,zs,fit1)
persp(xs,zs,truth)
persp(b4)
par(old.par)
# very large dataset example using knots
n<-10000
x<-runif(n);z<-runif(n);
y<-test1(x,z)+rnorm(n)
ind<-sample(1:n,1000,replace=FALSE)
b5<-gam(y~s(x,z,k=50),knots=list(x=x[ind],z=z[ind]))
persp.gam(b5)
# and a pure "knot based" spline of the same data
b6<-gam(y~s(x,z,k=100),knots=list(x= rep((1:10-0.5)/10,10),
z=rep((1:10-0.5)/10,rep(10,10))))
persp.gam(b6)
Run the code above in your browser using DataLab