Learn R Programming

opticut (version 0.1-3)

opticut: Optimal Binary Response Model

Description

The functions fits the multi-level response model for each species by finding the best binary partition based on model selection. Possibly controlling for modifying/confounding variables. The general algorithm is described in Kemencei et al. 2014.

Usage

opticut1(Y, X, Z, dist = "gaussian", sset=NULL, ...)

opticut(...) # S3 method for default opticut(Y, X, strata, dist = "gaussian", comb = c("rank", "all"), sset=NULL, cl = NULL, ...) # S3 method for formula opticut(formula, data, strata, dist = "gaussian", comb = c("rank", "all"), sset=NULL, cl = NULL, ...)

fix_levels(x, sep = "_") strata(object, ...) # S3 method for opticut strata(object, ...)

# S3 method for opticut bestmodel(object, which = NULL, ...) # S3 method for opticut bestpart(object, pos_only = FALSE, ...) # S3 method for opticut getMLE(object, which, vcov=FALSE, ...) # S3 method for opticut subset(x, subset=NULL, ...) # S3 method for opticut fitted(object, ...) # S3 method for opticut predict(object, gnew=NULL, xnew=NULL, ...)

wplot(x, ...) # S3 method for opticut1 wplot(x, cut, ylim = c(-1, 1), las=1, ylab = "Model weight * Association", xlab = "Partitions", theme, mar = c(5, 4, 4, 4) + 0.1, bty = "o", ...) # S3 method for opticut wplot(x, which = NULL, cut, sort, las = 1, ylab = "Model weight * Association", xlab = "Partitions", theme, mar = c(5, 4, 4, 4) + 0.1, bty = "o", ...) # S3 method for opticut plot(x, which = NULL, cut, sort, las, ylab = "Relative abundance", xlab = "Strata", show_I = TRUE, show_S = TRUE, hr = TRUE, tick = TRUE, theme, mar = c(5, 4, 4, 4) + 0.1, bty = "o", lower = 0, upper = 1, pos = 0, horizontal=TRUE, ...)

# S3 method for opticut1 print(x, cut, sort, digits, ...) # S3 method for opticut print(x, digits, ...) # S3 method for summary.opticut print(x, cut, sort, digits, ...) # S3 method for opticut summary(object, ...)

# S3 method for opticut as.data.frame(x, row.names = NULL, optional = FALSE, cut, sort, ...) # S3 method for summary.opticut as.data.frame(x, row.names = NULL, optional = FALSE, cut, sort, ...)

Value

opticut1 returns an object of class opticut1, that is a modified data frame with additional attributes.

opticut returns an object of class opticut, that is a list with the following components:

"call"

the function call.

"species"

a list of species specific opticut1 objects.

"X"

modifying variables as model matrix.

"Y"

response, single species vector or matrix.

"strata"

defines the partitions.

"nobs"

sample size.

"sset"

subset, if specified.

"nsplit"

number of binary splits considered.

"dist"

distribution.

"comb"

combination type.

"failed"

IDs for failed species models dropped from results list.

"collapse"

character used for combining partition labels.

fix_levels returns a factor with modified levels.

The strata method extracts the strata argument as factor. The method finds unique row combinations when custom matrix is supplied for strata.

The print and summary methods are called for their side effects. The summary shows the following information: best supported split, strength and sign of association, indicator potential (I), expected values (mu0, mu1), log likelihood ratio (logLR), and model weights(w).

The subset method subsets the species in the opticut object.

The plot method presents the contrasts by species and strata.

The wplot (weight plot) shows model weights for partitions.

bestpart returns a matrix with the best supported partitions for each species (samples and rows, species as columns).

bestmodel returns the best supported model for further manipulation (e.g. prediction). Note: custom distribution functions are designed to return only point estimates, thus the best model cannot be returned. In this case, use the best partition returned by bestpart to refit the model. getMLE returns a named list corresponding to the best supported model. The list has the following elements: coef is the Maximum Likelihood Estimate (MLE), vcov is the variance-covariance matrix for the MLE or NULL, dist is the distribution inherited from input object.

fitted returns expected values on the predictor scale for the observations as a matrix (number of observations by number of species). predict returns fitted values when both gnew

and xnew are NULL, or corresponding point predictions (expected values) on the predictor scale (available for comb = "rank" models only).

The coercion methods as.data.frame return a data frame.

Arguments

formula

two sided model formula, response species data (matrix, or possible a vector for single species case) in the left-hand side, model terms for modifying effects in the right-hand side (its structure depending on the underlying functions). For example, in the most basic Gaussian case it can be y ~ 1 (no modifying variables) or y ~ x (with modifying variables). Centering the modifying terms (or choosing the origin wisely) is generally recommended (especially for Gaussian distribution where linear predictors are additive on the response scale) because the relative abundance contrast is estimated at the origin (0).

data

an optional data frame, list or environment containing the variables in the model. If not found in data, the variables are taken from parent.frame(), typically the environment from which opticut is called.

strata

vector (usually a factor), unique values define partitions (must have at least 2 unique levels, empty levels are dropped). It can also be a matrix with rows as observations and binary partitions as columns.

dist

character or function, a distribution to fit. If character, it can follow one of these patterns: "family", or "family:link" when appropriate (there is a link argument in the underlying function, or the link can be specified via the family argument). See Details and Examples.

comb

character, how to define the binary partitions. "rank" uses rankComb, "all" uses allComb.

sset

an optional vector specifying a subset of observations (rows) to be used in the fitting process. NULL means no subset taken.

cl

a cluster object, or an integer for multiple cores in parallel computations (integer value for forking is ignored on Windows).

Y

numeric vector of observations for opticut1, vector or community matrix for opticut.default.

X

numeric, design matrix. Can be missing, in which case an intercept-only model is assumed.

Z

factor (must have at least 2 unique levels, this triggers rankComb), or a design matrix (custom matrix or as returned by allComb.

x, object

object to plot, print, summarize. For fix_levels it needs to be a factor.

cut

log likelihood ratio value to be used as a cut-off for showing species whose log likelihood ratio is not less than the cut-off.

sort

logical value indicating if species/partitions should be meaningfully sorted, the default is TRUE. It can take numeric value when only species (1) or partitions (2) are to be sorted (1:2 is equivalent to TRUE).

show_I

logical, if indicator potential (I) should be shown.

show_S

logical, if number of indicator species should be shown.

hr, tick

logical, if horizontal rules (hr) and ticks to the axis legends (tick) should be added. Default is TRUE for both.

theme

color theme as defined by occolors.

mar

numeric, graphical parameters for plot margin par.

ylab, xlab, las, ylim

graphical arguments, see plot. By default, las is 1 when horizontal = TRUE and 2 when horizontal = FALSE.

bty

Character, determines the type of box which is drawn around plots, see par.

lower, upper

numeric (between 0 and 1), lower is the minimum and upper is the maximum height for rectangles drawn in the plot. Both need to be in [0, 1] and higher cannot be smaller than lower.

pos

numeric, position of rectangles in the plot relative to the baseline. Value must be in the [-1, 1] range (below vs. above baseline).

horizontal

logical, plot orientation: species as rows (TRUE) or as columns (FALSE).

digits

numeric, number of significant digits in output.

which

numeric or character (can be a vector) defining a subset of species from the fitted object, or NULL (all species, default).

sep

a character string to separate the sub-strings in factor levels.

row.names

NULL or a character vector giving the row names for the data frame. Missing values are not allowed. See as.data.frame.

optional

logical. If TRUE, setting row names and converting column names (to syntactic names: see make.names) is optional. See as.data.frame.

pos_only

logical, best partition normally returns the original variable without recognizing the direction of the association. pos_only = TRUE returns values where negative associations are taken into account and 1 indicates strata of positive association. This is only important when comb is not "rank".

subset

logical, numeric, or character index indicating species to keep, missing values are not accepted. The default NULL returns the original object without subsetting.

vcov

logical, if variance-covariance matrix is to be returned.

gnew, xnew

new values for strata and modifiers (right-hand-side of formula) to predict for, or NULL. Predicting for new strata available for comb = "rank" models only.

...

other arguments passed to the underlying functions.

Warning

The use of the opticut1 function is generally discouraged: some of the internal checks are not guaranteed to flag issues when the formula-to-model-matrix translation is side-stepped (this is what is happening when the modifier variables are supplied as X argument in opticut1). Use the opticut with a single species instead.

Author

Peter Solymos <psolymos@gmail.com> and Ermias T. Azeria

Details

Currently available distributions:

"gaussian"

real valued continuous observations, e.g. biomass, uses lm of the stats package. Identity link is assumed. Centering modified variables is generally advised to avoid negative expected values when the response is nonnegative.

"poisson"

Poisson count data, uses glm of the stats package. Exponential (log) link is assumed.

"binomial"

presence-absence (detection-nondetection) type data, uses glm of the stats package. Logistic (logit) link is assumed.

"negbin"

overdispersed Negative Binomial count data, uses glm.nb of the MASS package. Exponential (log) link is assumed.

"beta"

continuous response in the unit interval (0-1), e.g. percent cover, uses betareg of the betareg package. Logistic (logit) link for the mean model is assumed.

"zip"

zero-inflated Poisson counts, indicative properties are tested as part of the abundance model, uses zeroinfl of the pscl package. Exponential (log) link is used for count based analysis, the second part of the dist argument following the colon is used as link function for the zero component (logistic link assumed).

"zinb"

zero-inflated Negative Binomial counts, indicative properties are tested as part of the abundance model, uses zeroinfl of the pscl package. The zero-inflation component refers to the probability of 0. Exponential (log) link is used for count based analysis, the second part of the dist argument following the colon is used as link function for the zero component (logistic link assumed).

"zip2"

zero-inflated Poisson counts, indicative properties are tested as part of the zero-model, uses zeroinfl of the pscl package. The zero-inflation component refers to the probability of 1 to be consistent with other methods regarding positive and negative effects. Logistic (logit) link is assumed for zero-nonzero based analysis, only symmetric link functions (logit, probit) allowed. Exponential (log) link is used for the count data part which cannot be changed.

"zinb2"

zero-inflated Negative Binomial counts, indicative properties are tested as part of the zero-model, uses zeroinfl of the pscl package. The zero-inflation component refers to the probability of 1 to be consistent with other methods regarding positive and negative effects. Logistic (logit) link is assumed for zero-nonzero based analysis, only symmetric link functions (logit, probit) allowed. Exponential (log) link is used for the count data part which cannot be changed.

"rsf"

presence-only data using resource selection functions (RSF) as explained in rsf in the ResourceSelection package, assuming global availability (m = 0). The "rsf" works only for single species using opticut1 because 'presence-only' type data cannot be kept in a single matrix-like object for multiple species. Intercept only model (i.e. no modifier variables in right-hand-side of the formula) is accepted for "rsf". Exponential (log) link is assumed.

"rspf"

presence-only data using resource selection probability functions (RSPF) as explained in rspf in the ResourceSelection package, assuming global availability (m = 0). The "rspf" works only for single species using opticut1 because 'presence-only' type data cannot be kept in a single matrix-like object for multiple species. Intercept only model is not accepted for "rspf", need to have at least one continuous modifier variable for identifiability (see Solymos & Lele 2016). Logistic (logit) link is assumed.

Custom distributions can be defined, see Examples. Note: not all downstream algorithms and methods work with custom distributions.

fix_levels is a utility function for replacing characters in factor levels that are identical to the value of the getOption("ocoptions")$collapse value. This case can lead to an error when specifying the strata argument, and the fix_levels can help.

References

Kemencei, Z., Farkas, R., Pall-Gergely, B., Vilisics, F., Nagy, A., Hornung, E. & Solymos, P. (2014): Microhabitat associations of land snails in forested dolinas: implications for coarse filter conservation. Community Ecology 15:180--186. <doi:10.1556/ComEc.15.2014.2.6>

Solymos, P. & Lele, S. R. (2016): Revisiting resource selection probability functions and single-visit methods: clarification and extensions. Methods in Ecology and Evolution 7:196--205. <doi:10.1111/2041-210X.12432>

See Also

allComb, and rankComb for partitioning algorithms.

beta2i for indicator potential (I) calculations in summaries.

bestmodel, bestpart, and uncertainty for manipulating fitted objects.

ocoptions on how to set some of the global options related to the presentation of the results in the package and how errors encountered during model fitting are handled.

multicut for multinomial-response model, optilevels for finding the optimal number of factor levels.

Examples

Run this code
## --- Gaussian
## simple example from Legendre 2013
## Indicator Species: Computation, in
## Encyclopedia of Biodiversity, Volume 4
## https://dx.doi.org/10.1016/B978-0-12-384719-5.00430-5
gr <- as.factor(paste0("X", rep(1:5, each=5)))
spp <- cbind(Species1=rep(c(4,6,5,3,2), each=5),
    Species2=c(rep(c(8,4,6), each=5), 4,4,2, rep(0,7)),
    Species3=rep(c(18,2,0,0,0), each=5))
rownames(spp) <- gr
## must add some noise to avoid perfect fit
spp[6, "Species1"] <- 7
spp[1, "Species3"] <- 17
spp

## all partitions
summary(ocall <- opticut(spp ~ 1, strata=gr, dist="gaussian", comb="all"))
summary(opticut(spp, strata=gr, dist="gaussian", comb="all")) # alternative

## rank based partitions
summary(ocrank <- opticut(spp ~ 1, strata=gr, dist="gaussian", comb="rank"))
summary(opticut(spp, strata=gr, dist="gaussian", comb="rank")) # alternative

## --- Binomial
## simulated binary data
set.seed(1234)
n <- 200
x0 <- sample(1:4, n, TRUE)
x1 <- ifelse(x0 <= 2, 1, 0)
x2 <- rnorm(n, 0.5, 1)
p1 <- plogis(-0.5 + 2*x1 + -0.8*x2)
Y1 <- rbinom(n, 1, p1)
p2 <- plogis(-0.1 + 2*ifelse(x0==4,1,0) + -0.8*x2)
Y2 <- rbinom(n, 1, p2)
p3 <- plogis(-0.1 + -0.8*x2)
Y3 <- rbinom(n, 1, p3)
Y <- cbind(SPP1=Y1, SPP2=Y2, SPP3=Y3)
X <- model.matrix(~x2)

## all partitions, single species
Z <- allComb(x0)
opticut1(Y1, X, Z, dist="binomial")

## rank based partitions, single species
opticut1(Y1, X, as.factor(x0), dist="binomial")

## all partitions, multiple species
(m1 <- opticut(Y ~ x2, strata=x0, dist="poisson", comb="all"))
summary(m1)
## show all species
summary(m1, cut=0)
## plot best partitions and indicator values
plot(m1)
## model weights for all species
wplot(m1)
## different ways of plotting weights for single species
wplot(m1$species[[1]])
wplot(m1, which = 1)

## rank based partitions, multiple species
summary(m2 <- opticut(Y ~ x2, strata=x0, dist="poisson", comb="rank"))
## subset results
summary(subset(m2, 1:2))

## best partition
head(bestpart(m2))

## best model
mods <- bestmodel(m2)
mods
## explore further
confint(mods[[1]])

## MLE and variance-covariance matrix (species 1)
getMLE(m2, which=1, vcov=TRUE)

## fitted values
head(fitted(m2))
## prediction for new data
head(predict(m2, gnew=x0, xnew=data.frame(x2=x2)))

if (FALSE) {
## --- Zero-inflated Negative Binomial
## dolina example
data(dolina)
## stratum as ordinal
dolina$samp$stratum <- as.integer(dolina$samp$stratum)
## filter species to speed up things a bit
Y <- dolina$xtab[,colSums(dolina$xtab > 0) >= 20]
## opticut results, note the cloglog link function
dol <- opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="zinb:cloglog")
summary(dol)
## vertical plot orientation
plot(dol, horizontal=FALSE, pos=1, upper=0.8)

## parallel computing comparisons
library(parallel)
cl <- makeCluster(2)
## sequential, all combinations (2^(K-1) - 1)
system.time(opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="zinb", comb="all", cl=NULL))
## sequential, rank based combinations (K - 1)
system.time(opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="zinb", comb="rank", cl=NULL))
## parallel, all combinations (2^(K-1) - 1)
system.time(opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="zinb", comb="all", cl=cl))
## parallel, rank based combinations (K - 1)
system.time(opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist="zinb", comb="rank", cl=cl))
stopCluster(cl)

## --- Customizing distributions
## we may want to expand the Zero-inflation component in a ZIP model
## see how the return value needs to be structured
fun <- function(Y, X, linkinv, zi_term, ...) {
    X <- as.matrix(X)
    mod <- pscl::zeroinfl(Y ~ X-1 | zi_term, dist = "poisson", ...)
    list(coef=coef(mod),
        logLik=logLik(mod),
        linkinv=mod$linkinv)
}
Xdol <- model.matrix(~ stratum + lmoist + method, data=dolina$samp)
## this fits the null model (i.e. no partitions added)
fun(Y[,"amin"], Xdol, zi_term=dolina$samp$method)
## now we can use dist=fun
opticut1(Y[,"amin"], Xdol, Z=dolina$samp$mhab,
    dist=fun, zi_term=dolina$samp$method)
dol2 <- opticut(Y ~ stratum + lmoist + method, data=dolina$samp,
    strata=dolina$samp$mhab, dist=fun, zi_term=dolina$samp$method)
summary(dol2)
}

## current collapse value
getOption("ocoptions")$collapse
## factor levels sometimes need to be manipulated
## before feeding it to opticut
fix_levels(as.factor(c("A b", "C d")), sep=":")
fix_levels(as.factor(c("A b", "C d")), sep="")

Run the code above in your browser using DataLab