ssden: Estimating Probability Density Using Smoothing Splines

Description

Estimate probability densities using smoothing spline ANOVA models. The symbolic model specification via formula follows the same rules as in lm, but with the response missing.

Usage

ssden(formula, type=NULL, data=list(), alpha=1.4, weights=NULL,
      subset, na.action=na.omit, id.basis=NULL, nbasis=NULL, seed=NULL,
      domain=as.list(NULL), quad=NULL, qdsz.depth=NULL, bias=NULL,
      prec=1e-7, maxiter=30, skip.iter=FALSE)
ssden1(formula, type=NULL, data=list(), alpha=1.4, weights=NULL,
       subset, na.action=na.omit, id.basis=NULL, nbasis=NULL, seed=NULL,
       domain=as.list(NULL), quad=NULL, prec=1e-7, maxiter=30)

Value

ssden returns a list object of class "ssden".

ssden1 returns a list object of class

c("ssden1","ssden").

dssden and cdssden can be used to evaluate the estimated joint density and conditional density;

pssden, qssden, cpssden, and cqssden can be used to evaluate (conditional) cdf and quantiles.

The method project.ssden can be used to calculate the Kullback-Leibler projection of "ssden" objects for model selection; project.ssden1 can be used to calculate the square error projection of "ssden1" objects.

Arguments

formula: Symbolic description of the model to be fit.
type: List specifying the type of spline for each variable. See mkterm for details.
data: Optional data frame containing the variables in the model.
alpha: Parameter defining cross-validation score for smoothing parameter selection.
weights: Optional vector of bin-counts for histogram data.
subset: Optional vector specifying a subset of observations to be used in the fitting process.
na.action: Function which indicates what should happen when the data contain NAs.
id.basis: Index of observations to be used as "knots."
nbasis: Number of "knots" to be used. Ignored when id.basis is specified.
seed: Seed to be used for the random generation of "knots." Ignored when id.basis is specified.
domain: Data frame specifying marginal support of density.
quad: Quadrature for calculating integral. Mandatory if variables other than factors or numerical vectors are involved.
qdsz.depth: Depth to be used in smolyak.quad for the generation of quadrature.
bias: Input for sampling bias.
prec: Precision requirement for internal iterations.
maxiter: Maximum number of iterations allowed for internal iterations.
skip.iter: Flag indicating whether to use initial values of theta and skip theta iteration. See ssanova for notes on skipping theta iteration.

Author

Chong Gu, chong@stat.purdue.edu

Details

The model specification via formula is for the log density. For example, ~x1*x2 prescribes a model of the form $$ log f(x1,x2) = g_{1}(x1) + g_{2}(x2) + g_{12}(x1,x2) + C $$ with the terms denoted by "x1", "x2", and "x1:x2"; the constant is determined by the fact that a density integrates to one.

The selective term elimination may characterize (conditional) independence structures between variables. For example, ~x1*x2+x1*x3 yields the conditional independence of x2 and x3 given x1.

Parallel to those in a ssanova object, the model terms are sums of unpenalized and penalized terms. Attached to every penalized term there is a smoothing parameter, and the model complexity is largely determined by the number of smoothing parameters.

The selection of smoothing parameters is through a cross-validation mechanism described in the references, with a parameter alpha; alpha=1 is "unbiased" for the minimization of Kullback-Leibler loss but may yield severe undersmoothing, whereas larger alpha yields smoother estimates.

A subset of the observations are selected as "knots." Unless specified via id.basis or nbasis, the number of "knots" $q$ is determined by $max(30,10n^{2/9})$, which is appropriate for the default cubic splines for numerical vectors.

References

Gu, C. and Wang, J. (2003), Penalized likelihood density estimation: Direct cross-validation and scalable approximation. Statistica Sinica, 13, 811--826.

Gu, C., Jeon, Y., and Lin, Y. (2013), Nonparametric density estimation in high dimensions. Statistica Sinica, 23, 1131--1153.

Gu, C. (2013), Smoothing Spline ANOVA Models (2nd Ed). New York: Springer-Verlag.

Gu, C. (2014), Smoothing Spline ANOVA Models: R Package gss. Journal of Statistical Software, 58(5), 1-25. URL http://www.jstatsoft.org/v58/i05/.

Examples

Run this code

## 1-D estimate: Buffalo snowfall
data(buffalo)
buff.fit <- ssden(~buffalo,domain=data.frame(buffalo=c(0,150)))
plot(xx<-seq(0,150,len=101),dssden(buff.fit,xx),type="l")
plot(xx,pssden(buff.fit,xx),type="l")
plot(qq<-seq(0,1,len=51),qssden(buff.fit,qq),type="l")
## Clean up
if (FALSE) rm(buffalo,buff.fit,xx,qq)
dev.off()

## 2-D with triangular domain: AIDS incubation
data(aids)
## rectangular quadrature
quad.pt <- expand.grid(incu=((1:40)-.5)/40*100,infe=((1:40)-.5)/40*100)
quad.pt <- quad.pt[quad.pt$incu<=quad.pt$infe,]
quad.wt <- rep(1,nrow(quad.pt))
quad.wt[quad.pt$incu==quad.pt$infe] <- .5
quad.wt <- quad.wt/sum(quad.wt)*5e3
## additive model (pre-truncation independence)
aids.fit <- ssden(~incu+infe,data=aids,subset=age>=60,
                  domain=data.frame(incu=c(0,100),infe=c(0,100)),
                  quad=list(pt=quad.pt,wt=quad.wt))
## conditional (marginal) density of infe
jk <- cdssden(aids.fit,xx<-seq(0,100,len=51),data.frame(incu=50))
plot(xx,jk$pdf,type="l")
## conditional (marginal) quantiles of infe (TIME-CONSUMING)
if (FALSE) {
cqssden(aids.fit,c(.05,.25,.5,.75,.95),data.frame(incu=50))
}
## Clean up
if (FALSE) rm(aids,quad.pt,quad.wt,aids.fit,jk,xx)
dev.off()

## One factor plus one vector
data(gastric)
gastric$trt
fit <- ssden(~futime*trt,data=gastric)
## conditional density
cdssden(fit,c("1","2"),cond=data.frame(futime=150))
## conditional quantiles
cqssden(fit,c(.05,.25,.5,.75,.95),data.frame(trt=as.factor("1")))
## Clean up
if (FALSE) rm(gastric,fit)

## Sampling bias
## (X,T) is truncated to Tt)&(x<1)
  while(m<-sum(!ok)) {
    t[!ok] <- runif(m)
    x[!ok] <- rnorm(m,.5,.15)
    ok <- (x>t)&(x<1)
  }
  cbind(x,t)
}
xt <- rbias(100)
x <- xt[,1]; t <- xt[,2]
## length-biased
bias1 <- list(t=1,wt=1,fun=function(t,x){x[,]})
fit1 <- ssden(~x,domain=list(x=c(0,1)),bias=bias1)
plot(xx<-seq(0,1,len=101),dssden(fit1,xx),type="l")
## truncated
bias2 <- list(t=t,wt=rep(1/100,100),fun=function(t,x){x[,]>t})
fit2 <- ssden(~x,domain=list(x=c(0,1)),bias=bias2)
plot(xx,dssden(fit2,xx),type="l")
## Clean up
if (FALSE) rm(rbias,xt,x,t,bias1,fit1,bias2,fit2)

Run the code above in your browser using DataLab