
bigglm
creates a generalized linear model object that uses only
p^2
memory for p
variables.
bigglm(formula, data, family=gaussian(),...)
# S3 method for data.frame
bigglm(formula, data,...,chunksize=5000)
# S3 method for function
bigglm(formula, data, family=gaussian(),
weights=NULL, sandwich=FALSE, maxit=8, tolerance=1e-7,
start=NULL,quiet=FALSE,...)
# S3 method for RODBC
bigglm(formula, data, family=gaussian(),
tablename, ..., chunksize=5000)
# S4 method for ANY,DBIConnection
bigglm(formula, data, family=gaussian(),
tablename, ..., chunksize=5000)
# S3 method for bigglm
vcov(object,dispersion=NULL, ...)
# S3 method for bigglm
deviance(object,...)
# S3 method for bigglm
family(object,...)
# S3 method for bigglm
AIC(object,...,k=2)
A model formula
See Details below. Method dispatch is on this argument
A glm family object
Size of chunks for processng the data frame
A one-sided, single term formula specifying weights
TRUE
to compute the Huber/White sandwich
covariance matrix (uses p^4
memory rather than p^2
)
Maximum number of Fisher scoring iterations
Tolerance for change in coefficient (as multiple of standard error)
Optional starting values for coefficients. If
NULL
, maxit
should be at least 2 as some quantities
will not be computed on the first iteration
A bigglm
object
Dispersion parameter, or NULL
to estimate
For the SQLiteConnection
method, the name of a
SQL table, or a string specifying a join or nested select
penalty per parameter for AIC
When FALSE
, warn if the fit did not converge
Additional arguments
An object of class bigglm
The data
argument may be a function, a data frame, or a
SQLiteConnection
or RODBC
connection object.
When it is a function the function must take a single argument
reset
. When this argument is FALSE
it returns a data
frame with the next chunk of data or NULL
if no more data are
available. Whenreset=TRUE
it indicates that the data should be
reread from the beginning by subsequent calls. The chunks need not be
the same size or in the same order when the data are reread, but the
same data must be provided in total. The bigglm.data.frame
method gives an example of how such a function might be written,
another is in the Examples below.
The model formula must not contain any data-dependent terms, as these will not be consistent when updated. Factors are permitted, but the levels of the factor must be the same across all data chunks (empty factor levels are ok). Offsets are allowed (since version 0.8).
The SQLiteConnection
and RODBC
methods loads only the
variables needed for the model, not the whole table. The code in the
SQLiteConnection
method should work for other DBI
connections, but I do not have any of these to check it with.
Algorithm AS274 Applied Statistics (1992) Vol.41, No. 2
biglm
, glm
# NOT RUN {
data(trees)
ff<-log(Volume)~log(Girth)+log(Height)
a <- bigglm(ff,data=trees, chunksize=10, sandwich=TRUE)
summary(a)
gg<-log(Volume)~log(Girth)+log(Height)+offset(2*log(Girth)+log(Height))
b <- bigglm(gg,data=trees, chunksize=10, sandwich=TRUE)
summary(b)
# }
# NOT RUN {
## requires internet access
make.data<-function(urlname, chunksize,...){
conn<-NULL
function(reset=FALSE){
if(reset){
if(!is.null(conn)) close(conn)
conn<<-url(urlname,open="r")
} else{
rval<-read.table(conn, nrows=chunksize,...)
if (nrow(rval)==0) {
close(conn)
conn<<-NULL
rval<-NULL
}
return(rval)
}
}
}
airpoll<-make.data("http://faculty.washington.edu/tlumley/NO2.dat",
chunksize=150,
col.names=c("logno2","logcars","temp","windsp",
"tempgrad","winddir","hour","day"))
b<-bigglm(exp(logno2)~logcars+temp+windsp,
data=airpoll, family=Gamma(log),
start=c(2,0,0,0),maxit=10)
summary(b)
# }
Run the code above in your browser using DataLab