glmulti: Automated model selection and multimodel inference with (G)LMs

Description

glmulti finds what are the n best models (the confidence set of models) among all possible models (the candidate set, as specified by the user). Models are fitted with the specified fitting function (default is glm) and are ranked with the specified Information Criterion (default is aicc). The best models are found either through exhaustive screening of the candidates, or using a genetic algorithm, which allows very large candidate sets to be adressed. The output can be used for model selection, variable selection, and multimodel inference.

Usage

#  glmulti S4 generic 
glmulti(y, xr, data, exclude = c(), name = "glmulti.analysis", 
		intercept = TRUE, marginality = FALSE,
		bunch=30, chunk = 1, chunks = 1, level = 2,
		minsize = 0, maxsize = -1, minK = 0, maxK = -1,
		method = "h", crit = "aic", confsetsize = 100, popsize = 100, 
		mutrate = 10^-3, sexrate = 0.1, imm = 0.3, plotty = TRUE,
		report = TRUE, deltaM = 0.05, deltaB = 0.05, conseq = 5,
		fitfunction = "glm", resumefile = "id", includeobjects=TRUE, ...)

Arguments

A formula, character string, or fitted model (of class lm or glm) specifying the response variable and the terms (main effects and/or interactions) to be used in the candidate models (e.g. height~age*sex+mass). Alternatively, a character string naming the variable to be used as response (e.g. "height") (in which case the names of the predictors must be passed through the xr argument) Alternatively, a custom list of (fitted) model objects can also be passed (can be convenient for small candidate sets).

An optional character array specifying the variables (categorical or quantitative) to be used as predictors, e.g. c("age", "height" , "mass")

exclude

Optional character vector naming terms (main effects or interactions) to be excluded from the candidate models, e.g. c("mass:height")

intercept

Whether to include an intercept in the candidate models or not.

level

If 1, only main effects (terms of order 1) are used to build the candidate set. If 2, pairwise interactions are also used (higher order interactions are currently ignored)

data

A data.frame containing the data. If not specified, glmulti will try to find the data in the environment of the formula, from the fitted model passed as y argument, or from the global environment.

name

The name of this glmulti analysis. Optional.

marginality

Whether to apply the marginality rule or not. If TRUE, only marginal models will be considered.

minsize

This sets a constraint on candidate models. Minimal number of TERMS (main effects or interactions) to be included in candidate models (negative = no constraint)

maxsize

This sets a constraint on candidate models. Maximal number of TERMS to be included in candidate models (negative = no constraint)

minK

This sets a constraint on candidate models. Minimal complexity of candidate models (negative = no constraint)

maxK

This sets a constraint on candidate models. Maximal complexity of candidate models (negative = no constraint)

method

The method to be used to explore the candidate set of models. If "h" an exhaustive screening is undertaken. If "g" the genetic algorithm is employed (recommended for large candidate sets). If "l", a very fast exhaustive branch-and-bound algorithm is used. Package leaps must then be loaded, and this can only be applied to linear models with covariates and no interactions. If "d", a simple summary of the candidate set is printed, including the number of candidate models.

crit

The Information Criterion to be used. This should be a function that accepts a fitted model as first argument. Default is the original Akaike IC (aic). Other provided functions are the Bayes IC (bic), the small-sample corrected AIC (aicc) and QAIC/QAICc (qaic and qaicc).

fitfunction

The fitting function to be used. Any function similar to glm can be used. See Examples

confsetsize

The number of models to be looked for, i.e. the size of the returned confidence set.

plotty

Whether to plot the progress of the IC profile when running.

report

Whether to report about the progress at run time.

bunch

The number of model formulas to be returned (to be fitted) at each call to the enumerator. Exhaustive screening only.

chunk

When using an exhaustive screening approach, it can be splitted in several parts to take advantage of multiple CPUs. chunk is an integer specifying which part the current call should perform.

chunks

When splitting an exhaustive screening, the total number of parts the task should be divided into. For example, with a quad-core processor, 4 may be useful. Use consensus to bring back the pieces into a single object.

popsize

The population size for the genetic algorithm

mutrate

The per locus (i.e. per term) mutation rate for genetic algorithm, between 0 and 1

sexrate

The rate of sexual reproduction for the genetic algorithm, between 0 and 1

imm

The rate of immigration for the genetic algorithm, between 0 and 1

deltaM

The target change in mean IC (defines the stop rules for the genetic algorithm)

deltaB

The target change in best IC (defines the stop rules for the genetic algorithm)

conseq

The target successive number of times with no improvement (i.e. target changes have been attained) (defines the stop rule for the GA). The greater it is, the more stringent the stop rule.

resumefile

When resuming an analysis (method="r"), the name of the files from which to resume. Default uses the same as name

includeobjects

Whether or not to include fiited models as objects. This makes coef and predict faster and is very convenient, but can be turned off in case fitted models are very large or are not to be used after.

…

Further arguments to be passed to the fitting function. E.g. maxit=50 or family=binomial

Value

An object of class glmulti is returned. It is a S4 object with several slots containing relevant data for model selection and beyond.

Several standard S3 functions are provided to help access the content of this object.

Several glmulti objects can be shrunk to one using the function consensus. This is useful to get the best of several replicates (of the genetic algorithm) or to bring together the different parts of a splitted exhaustive screening. When running a genetic algorithm, two tiny java files (serialized objects) are also written to the disk at regular intervals. They can be used to resume the calculation (method="r") if it was interrupted for any reason. This can also be used to continue a GA with modified parameters (e.g. mutation rate).

Details

glmulti is defined as a S4 function. It acts as a frontend that calls background compiled functions (contained if archive glmulti.jar). Running the function therefore requires a Java Running Environment, and package rJava. A thorough description of this function and package can be found in the article by Calcagno and de Mazancourt (see References). print.glmulti and summary.glmulti are S3 methods which provide a synthesis of glmulti analyses.

NOTE: When calling glmulti with a model object as y, only the formula will be extracted from the object. This means that optional arguments to the fitting function (e.g. family or maxit) will NOT be extracted. These arguments should be passed to glmulti through the ….

References

Buckland (1997) Model Selection: an Integral Part of Inference. Biometrics 10:41 Burnham & Anderson (2002) Model Selection and Multimodel Inference: an Information Theoretic Approach Calcagno \& de Mazancourt 2010 J. Stat. Soft. v34 i12. See http://www.jstatsoft.org/v34/i12

Examples

Run this code

# NOT RUN {
# See the document "glmulti.pdf" included in the package.
# It explains the general approach and shows how to use 
# glmulti with mixed models from the lme4 package.
# Other examples:
# A. This shows how to do the same for zero-inflated poisson models
# we load the required package
library(pscl)
# a random vector of count data
round(runif(100, 0,20)*round(runif(100)))-> vy2
# dummy predictors
va = runif(100)
vb = runif(100)
# 1. The wrapper function
zeroinfl.glmulti=function(formula, data, inflate = "|1",...)  {
    zeroinfl(as.formula(paste(deparse(formula), inflate)),data=data,...)
} 
# The default getfit and aicc method will work for zeroinfl objects,
# so no need to redefine them
# we can proceed directly
glmulti(vy2~va*vb,fitfunc=zeroinfl.glmulti,inflate="|1")->bab


# B. This shows how to include some terms in ALL the models
# As above, we just prepare a wrapper of the fitting function
glm.redefined = function(formula, data, always="", ...) {
glm(as.formula(paste(deparse(formula), always)), data=data, ...)
}
# we then use this fitting function in glmulti
glmulti(vy2~va,level=1,fitfunc=glm.redefined,always="+vb")-> bab
# va will be shuffled but vb is always included in the models

# this procedure allows support of arbitrarily any fitting function, 
# or the use of sophisticated constraints on the model structure

# }

Run the code above in your browser using DataLab