gssanova: Fitting Smoothing Spline ANOVA Models with Non-Gaussian Responses

Description

Fit smoothing spline ANOVA models in non-Gaussian regression. The symbolic model specification via formula follows the same rules as in lm and glm.

Usage

gssanova(formula, family, type=NULL, data=list(), weights, subset,
         offset, na.action=na.omit, partial=NULL, alpha=NULL, nu=NULL,
         id.basis=NULL, nbasis=NULL, seed=NULL, random=NULL,
         skip.iter=FALSE)

Value

gssanova returns a list object of class

c("gssanova","ssanova").

The method summary.gssanova can be used to obtain summaries of the fits. The method predict.ssanova can be used to evaluate the fits at arbitrary points along with standard errors, on the link scale. The method

project.gssanova can be used to calculate the Kullback-Leibler projection for model selection. The methods

residuals.gssanova and fitted.gssanova

extract the respective traits from the fits.

Arguments

formula: Symbolic description of the model to be fit.
family: Description of the error distribution. Supported are exponential families "binomial", "poisson", "Gamma", "inverse.gaussian", and "nbinomial". Also supported are accelerated life model families "weibull", "lognorm", and "loglogis". Further more, proportional odds logistic regression "polr" for ordinal response is also supported.
type: List specifying the type of spline for each variable. See mkterm for details.
data: Optional data frame containing the variables in the model.
weights: Optional vector of weights to be used in the fitting process.
subset: Optional vector specifying a subset of observations to be used in the fitting process.
offset: Optional offset term with known parameter 1.
na.action: Function which indicates what should happen when the data contain NAs.
partial: Optional symbolic description of parametric terms in partial spline models.
alpha: Tuning parameter defining cross-validation; larger values yield smoother fits. Defaults are alpha=1 for family="binomial" and alpha=1.4 otherwise.
nu: Inverse scale parameter in accelerated life model families. Ignored for exponential families.
id.basis: Index designating selected "knots".
nbasis: Number of "knots" to be selected. Ignored when id.basis is supplied.
seed: Seed for reproducible random selection of "knots". Ignored when id.basis is supplied.
random: Input for parametric random effects in nonparametric mixed-effect models. See mkran for details.
skip.iter: Flag indicating whether to use initial values of theta and skip theta iteration. See ssanova for notes on skipping theta iteration.

Responses

For family="binomial", the response can be specified either as two columns of counts or as a column of sample proportions plus a column of total counts entered through the argument weights, as in glm.

For family="nbinomial", the response may be specified as two columns with the second being the known sizes, or simply as a single column with the common unknown size to be estimated through the maximum likelihood.

For family="weibull", "lognorm", or "loglogis", the response consists of three columns, with the first giving the follow-up time, the second the censoring status, and the third the left-truncation time. For data with no truncation, the third column can be omitted.

For family="polr", the response should be an ordered factor.

Details

The model specification via formula is intuitive. For example, y~x1*x2 yields a model of the form $$ y = C + f_{1}(x1) + f_{2}(x2) + f_{12}(x1,x2) + e $$ with the terms denoted by "1", "x1", "x2", and "x1:x2".

The model terms are sums of unpenalized and penalized terms. Attached to every penalized term there is a smoothing parameter, and the model complexity is largely determined by the number of smoothing parameters.

Only one link is implemented for each family. It is the logit link for "binomial", and the log link for "poisson", and "Gamma". For "nbinomial", the working parameter is the logit of the probability $p$; see NegBinomial. For "weibull", "lognorm", and "loglogis", it is the location parameter for the log lifetime.

The selection of smoothing parameters is through direct cross-validation. The cross-validation score used for family="poisson" is taken from density estimation as in Gu and Wang (2003), and those used for other families are derived following the lines of Gu and Xiang (2001).

A subset of the observations are selected as "knots." Unless specified via id.basis or nbasis, the number of "knots" $q$ is determined by $max(30,10n^{2/9})$, which is appropriate for the default cubic splines for numerical vectors.

References

Gu, C. and Xiang, D. (2001), Cross validating non Gaussian data: generalized approximate cross validation revisited. Journal of Computational and Graphical Statistics, 10, 581--591.

Gu, C. and Wang, J. (2003), Penalized likelihood density estimation: Direct cross-validation and scalable approximation. Statistica Sinica, 13, 811--826.

Gu, C. (2013), Smoothing Spline ANOVA Models (2nd Ed). New York: Springer-Verlag.

Gu, C. (2014), Smoothing Spline ANOVA Models: R Package gss. Journal of Statistical Software, 58(5), 1-25. URL http://www.jstatsoft.org/v58/i05/.

Examples

Run this code

## Fit a cubic smoothing spline logistic regression model
test <- function(x)
        {.3*(1e6*(x^11*(1-x)^6)+1e4*(x^3*(1-x)^10))-2}
x <- (0:100)/100
p <- 1-1/(1+exp(test(x)))
y <- rbinom(x,3,p)
logit.fit <- gssanova(cbind(y,3-y)~x,family="binomial")
## The same fit
logit.fit1 <- gssanova(y/3~x,"binomial",weights=rep(3,101),
                       id.basis=logit.fit$id.basis)
## Obtain estimates and standard errors on a grid
est <- predict(logit.fit,data.frame(x=x),se=TRUE)
## Plot the fit and the Bayesian confidence intervals
plot(x,y/3,ylab="p")
lines(x,p,col=1)
lines(x,1-1/(1+exp(est$fit)),col=2)
lines(x,1-1/(1+exp(est$fit+1.96*est$se)),col=3)
lines(x,1-1/(1+exp(est$fit-1.96*est$se)),col=3)

## Fit a mixed-effect logistic model
data(bacteriuria)
bact.fit <- gssanova(infect~trt+time,family="binomial",data=bacteriuria,
                     id.basis=(1:820)[bacteriuria$id%in%c(3,38)],random=~1|id)
## Predict fixed effects
predict(bact.fit,data.frame(time=2:16,trt=as.factor(rep(1,15))),se=TRUE)
## Estimated random effects
bact.fit$b

## Clean up
if (FALSE) rm(test,x,p,y,logit.fit,logit.fit1,est,bacteriuria,bact.fit)
dev.off()

Run the code above in your browser using DataLab