optilevels: Optimal Number of Factor Levels

Description

Finds the optimal number of factor levels given the data and a model using a likelihood-based agglomerative algorithm.

Usage

optilevels(y, x, z = NULL, alpha = 0, dist = "gaussian", ...)
# S3 method for optilevels
bestmodel(object, ...)

Value

An object of class 'optilevels' that is a list with the following elements:

"delta": delta IC values along the selection path considering best models.

"ic"

IC values along the selection path considering best models.

"coef"

matrix of coefficients (linear predictor scale) corresponding to argument x along the selection path considering best models.

"zcoef"

matrix of coefficients (linear predictor scale) corresponding to argument z when not NULL along the selection path considering best models, or NULL.

"rank"

matrix ranks based on the coefficients along the selection path considering best models. Ranking uses the default ties.method = "average" in rank.

"deltalist"

delta IC values along the selection path considering all competing models.

"iclist"

IC values along the selection path considering all competing models.

"coeflist"

matrix of coefficients (linear predictor scale) corresponding to argument x along the selection path considering all competing models.

"zcoeflist"

matrix of coefficients (linear predictor scale) corresponding to argument z when not NULL along the selection path considering all competing models, or NULL.

"ranklist"

matrix ranks based on the coefficients along the selection path considering all competing models.

"levels"

list of (merged) factor levels along the selection path considering best models.

"Y"

vector of observations (argument y).

"X"

design matrix component corresponding to argument x.

"Z"

design matrix component corresponding to argument z.

"alpha"

weighting argument.

"dist"

distribution argument.

"factor"

logical, indicating if argument x is a factor (TRUE) or a matrix (FALSE).

bestmodel returns the best supported model for further manipulation (e.g. prediction).

Arguments

y: vector of observations.
x: a factor or a matrix of proportions (i.e. the values 0 and 1 should have consistent meaning across the columns, often through a unit sum constraint). It is the user's responsibility to ensure that values supplied for x are sensible. x is not expected to include an intercept.
z: a design matrix with predictor variables besides the one(s) defined via the argument x. It is the user's responsibility to ensure that values supplied for z are sensible and it also makes sense to bind x and z together. Variables in z should be centered (mean 0) (and possibly normalized by SD), because the design matrix from x is not expected to include an intercept.
alpha: numeric [0-1], weighting factor for calculating information criteria for model selection (i.e. IC = (1-alpha)*AIC + alpha*BIC, also referred to as CAIC: consistent AIC).
dist: character, distribution argument passed to underlying functions, see listed on the help page of opticut (except for dist = "zip2", dist = "zinb2" dist = "rsf", and dist = "rspf").
object: fitted object.
...: other arguments passed to the underlying functions, see opticut1.

Author

Peter Solymos <psolymos@gmail.com>

Examples

Run this code

## --- Factor levels with Gaussian distribution
## simple example from Legendre 2013
## Indicator Species: Computation, in
## Encyclopedia of Biodiversity, Volume 4
## https://dx.doi.org/10.1016/B978-0-12-384719-5.00430-5
gr <- as.factor(paste0("X", rep(1:5, each=5)))
spp <- cbind(Species1=rep(c(4,6,5,3,2), each=5),
    Species2=c(rep(c(8,4,6), each=5), 4,4,2, rep(0,7)),
    Species3=rep(c(18,2,0,0,0), each=5))
rownames(spp) <- gr
## must add some noise to avoid perfect fit
spp[6, "Species1"] <- 7
spp[1, "Species3"] <- 17
spp

ol <- optilevels(spp[,"Species3"], gr)
ol[c("delta", "coef", "rank", "levels")]

## get the final factor level
gr1 <- gr
levels(gr1) <- ol$level[[length(ol$level)]]
table(gr, gr1)

## compare the models
o0 <- lm(spp[,"Species3"] ~ gr - 1)
o1 <- lm(spp[,"Species3"] ~ gr1 - 1)
data.frame(AIC(o0, o1), delta=AIC(o0, o1)$AIC - AIC(o0))
ol$delta # should be identical

## --- Proportions with Poisson distribution
## simulation
set.seed(123)
n <- 500 # number of observations
k <- 5 # number of habitat types
b <- c(-1, -0.2, -0.2, 0.5, 1)
names(b) <- LETTERS[1:k]
x <- replicate(k, exp(rnorm(n)))
x <- x / rowSums(x) # proportions
X <- model.matrix(~.-1, data=data.frame(x))
lam <- exp(drop(crossprod(t(X), b)))
y <- rpois(n, lam)

z <- optilevels(y, x, dist="poisson")

## best model refit
bestmodel(z)

## estimates
plogis(z$coef)
plogis(b)
## optimal classification
z$rank

## get the final matrix
x1 <- mefa4::groupSums(x, 2, z$levels[[length(z$levels)]])
head(x)
head(x1)

## compare the models
m0 <- glm(y ~ x - 1, family="poisson")
m1 <- glm(y ~ x1 - 1, family="poisson")
data.frame(AIC(m0, m1), delta=AIC(m0, m1)$AIC - AIC(m0))
z$delta # should be identical

if (FALSE) {
## dolina example with factor
data(dolina)
dolina$samp$stratum <- as.integer(dolina$samp$stratum)
y <- dolina$xtab[dolina$samp$method == "Q", "ppyg"]
x <- dolina$samp$mhab[dolina$samp$method == "Q"]
z <- scale(model.matrix(~ stratum + lmoist - 1,
    dolina$samp[dolina$samp$method == "Q",]))

## without additional covariates
dol1 <- optilevels(y, x, z=NULL, dist="poisson")
dol1$rank
summary(bestmodel(dol1))

## with additional covariates
dol2 <- optilevels(y, x, z, dist="poisson")
dol2$rank
summary(bestmodel(dol2))

## compare the two models
AIC(bestmodel(dol1), bestmodel(dol2))
}