brmsformula: Set up a model formula for use in the brms package

Description

Set up a model formula for use in the brms package allowing to define (potentially non-linear) additive multilevel models for all parameters of the assumed response distribution.

Usage

brmsformula(formula, ..., nonlinear = NULL)

Arguments

formula

An object of class formula (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under 'Details'.

...

Additional formula objects to specify predictors of special model parts and auxiliary parameters. Formulas can either be named directly or contain names on their left-hand side. Currently, the following names are accepted: sigma (residual standard deviation of the gaussian and student families); shape (shape parameter of the Gamma, weibull, negbinomial and related zero-inflated / hurdle families); nu (degrees of freedom parameter of the student family); phi (precision parameter of the beta and zero_inflated_beta families); zi (zero-inflation probability); hu (hurdle probability). All auxiliary parameters are modeled on the log or logit scale to ensure correct definition intervals after transformation.

nonlinear

An optional list of formuluas, specifying linear models for non-linear parameters. If NULL (the default) formula is treated as an ordinary formula. If not NULL, formula is treated as a non-linear model and nonlinear should contain a formula for each non-linear parameter, which has the parameter on the left hand side and its linear predictor on the right hand side. Alternatively, it can be a single formula with all non-linear parameters on the left hand side (separated by a +) and a common linear predictor on the right hand side. More information is given under 'Details'.

Value

An object of class brmsformula, which inherits from class formula but contains additional attributes.

Details

The formula argument accepts formulae of the following syntax: response | addition ~ Pterms + (Gterms | group) The Pterms part contains effects that are assumed to be the same across obervations. We call them 'population-level' effects or (adopting frequentist vocabulary) 'fixed' effects. The optional Gterms part may contain effects that are assumed to vary accross grouping variables specified in group. We call them 'group-level' effects or (adopting frequentist vocabulary) 'random' effects, although the latter name is misleading in a Bayesian context (for more details type vignette("brms")). Multiple grouping factors each with multiple group-level effects are possible. Instead of | you may use || in grouping terms to prevent correlations from being modeled. Alternatively, it is possible to model different group-level terms of the same grouping factor as correlated (even across different formulae, e.g. in non-linear models) by using || instead of |. All group-level terms sharing the same ID will be modeled as correlated. If, for instance, one specifies the terms (1+x|2|g) and (1+z|2|g) somewhere in the formulae passed to brmsformula, correlations between the corresponding group-level effects will be estimated. Smoothing terms can modeled using the s and t2 functions of the mgcv package in the Pterms part of the model formula. This allows to fit generalized additive mixed models (GAMMs) with brms. The implementation is similar to that used in the gamm4 package. For more details on this model class see gam and gamm. The Pterms and Gterms parts may contain two non-standard effect types namely monotonic and category specific effects, which can be specified using terms of the form monotonic() and cse() respectively. The latter can only be applied in ordinal models and is explained in more detail in the package's vignette (type vignette("brms")). The former effect type is explained here. A monotonic predictor must either be integer valued or an ordered factor, which is the first difference to an ordinary continuous predictor. More importantly, predictor categories (or integers) are not assumend to be equidistant with respect to their effect on the response variable. Instead, the distance between adjacent predictor categories (or integers) is estimated from the data and may vary across categories. This is realized by parameterizing as follows: One parameter takes care of the direction and size of the effect similar to an ordinary regression parameter, while an additional parameter vector estimates the normalized distances between consecutive predictor categories. A main application of monotonic effects are ordinal predictors that can this way be modeled without (falsely) treating them as continuous or as unordered categorical predictors. The third exception is the optional addition term, which may contain multiple terms of the form fun(variable) seperated by + each providing special information on the response variable. fun can be replaced with either se, weights, disp, trials, cat, cens, or trunc. Their meanings are explained below. For families gaussian and student, it is possible to specify standard errors of the observation, thus allowing to perform meta-analysis. Suppose that the variable yi contains the effect sizes from the studies and sei the corresponding standard errors. Then, fixed and random effects meta-analyses can be conducted using the formulae yi | se(sei) ~ 1 and yi | se(sei) ~ 1 + (1|study), respectively, where study is a variable uniquely identifying every study. If desired, meta-regression can be performed via yi | se(sei) ~ 1 + mod1 + mod2 + (1|study) or yi | se(sei) ~ 1 + mod1 + mod2 + (1 + mod1 + mod2|study), where mod1 and mod2 represent moderator variables. For all families, weighted regression may be performed using weights in the addition part. Internally, this is implemented by multiplying the log-posterior values of each observation by their corresponding weights. Suppose that variable wei contains the weights and that yi is the response variable. Then, formula yi | weights(wei) ~ predictors implements a weighted regression. The addition argument disp (short for dispersion) serves a similar purpose than weight. However, it has a different implementation and is less general as it is only usable for the families gaussian, student, lognormal, Gamma, weibull, and negbinomial. For the former four families, the residual standard deviation sigma is multiplied by the values given in disp, so that higher values lead to lower weights. Contrariwise, for the latter three families, the parameter shape is multiplied by the values given in disp. As shape can be understood as a precision parameter (inverse of the variance), higher values will lead to higher weights in this case. For families binomial and zero_inflated_binomial, addition should contain a variable indicating the number of trials underlying each observation. In lme4 syntax, we may write for instance cbind(success, n - success), which is equivalent to success | trials(n) in brms syntax. If the number of trials is constant across all observation (say 10), we may also write success | trials(10). For all ordinal families, addition may contain a term cat(number) to specify the number categories (e.g, cat(7)). If not given, the number of categories is calculated from the data. With the expection of categorical and ordinal families, left, right, and interval censoring can be modeled through y | cens(censored) ~ predictors. The censoring variable (named censored in this example) should contain the values 'left', 'none', 'right', and 'interval' (or equivalenty -1, 0, 1, and 2) to indicate that the corresponding observation is left censored, not censored, right censored, or interval censored. For interval censored data, a second variable (let's call it y2) has to be passed to cens. In this case, the formula has the structure y | cens(censored, y2) ~ predictors. While the lower bounds are given in y, the upper bounds are given in y2 for interval censored data. Intervals are assumed to be open on the left and closed on the right: (y, y2]. With the expection of categorical and ordinal families, the response distribution can be truncated using the trunc function in the addition part. If the response variable is truncated between, say, 0 and 100, we can specify this via yi | trunc(lb = 0, ub = 100) ~ predictors. Instead of numbers, variables in the data set can also be passed allowing for varying truncation points across observations. Defining only one of the two arguments in trunc leads to one-sided truncation.

Mutiple addition terms may be specified at the same time using the + operator, for instance formula = yi | se(sei) + cens(censored) ~ 1 for a censored meta-analytic model. For families gaussian and student, multivariate models may be specified using cbind notation. In brms 1.0.0, the multvariate 'trait' syntax was removed from the package as it repeatedly confused users, required much special case coding, and was hard to maintain. Below the new syntax is described. Suppose that y1 and y2 are response variables and x is a predictor. Then cbind(y1,y2) ~ x specifies a multivariate model, The effects of all terms specified at the RHS of the formula are assumed to vary across response variables (this was not the case by default in brms < 1.0.0). For instance, two parameters will be estimated for x, one for the effect on y1 and another for the effect on y2. This is also true for group-level effects. When writing, for instance, cbind(y1,y2) ~ x + (1+x|g), group-level effects will be estimated separately for each response. To model these effects as correlated across responses, use the ID syntax (see above). For the present example, this would look as follows: cbind(y1,y2) ~ x + (1+x|2|g). Of course, you could also use any value other than 2 as ID. It is not yet possible to model terms as only affecting certain responses (and not others), but this will be implemented in the future. Categorical models use the same syntax as multivariate models. As in most other implementations of categorical models, values of one category (the first in brms) are fixed to identify the model. Thus, all terms on the RHS of the formula correspond to K - 1 effects (K = number of categories), one for each non-fixed category. Group-level effects may be specified as correlated across categories using the ID syntax. As of brms 1.0.0, zero-inflated and hurdle models are specfied in the same way as as their non-inflated counterparts. However, they have additional auxiliary parameters (named zi and hu respectively) modeling the zero-inflation / hurdle probability depending on which model you choose. These parameters can also be affected by predictors in the same way the response variable itself. See the end of the Details section for information on how to accomplish that. Parameterization of the population-level intercept The population-level intercept (if incorporated) is estimated separately and not as part of population-level parameter vector b. also have to be specified separately (see set_prior for more details). Furthermore, to increase sampling efficiency, the fixed effects design matrix X is centered around its column means X_means if the intercept is incorporated. This leads to a temporary bias in the intercept equal to , where <,> is the scalar product. The bias is corrected after fitting the model, but be aware that you are effectively defining a prior on the temporary intercept of the centered design matrix not on the real intercept. This behavior can be avoided by using the reserved (and internally generated) variable intercept. Instead of y ~ x, you may write y ~ 0 + intercept + x. This way, priors can be defined on the real intercept, directly. In addition, the intercept is just treated as an ordinary fixed effect and thus priors defined on b will also apply to it. Note that this parameterization may be a bit less efficient than the default parameterization discussed above. Formula syntax for non-linear models Using the nonlinear argument, it is possible to specify non-linear models in brms. Contrary to what the name might suggest, nonlinear should not contain the non-linear model itself but rather information on the non-linear parameters. The non-linear model will just be specified within the formula argument. Suppose, that we want to predict the response y through the predictor x, where x is linked to y through y = alpha - beta * lambda^x, with parameters alpha, beta, and lambda. This is certainly a non-linear model being defined via formula = y ~ alpha - beta * lambda^x (addition arguments can be added in the same way as for ordinary formulas). Now we have to tell brms the names of the non-linear parameters and specfiy a (linear mixed) model for each of them using the nonlinear argument. Let's say we just want to estimate those three parameters with no further covariates or random effects. Then we can write nonlinear = alpha + beta + lambda ~ 1 or equivalently (and more flexible) nonlinear = list(alpha ~ 1, beta ~ 1, lambda ~ 1). This can, of course, be extended. If we have another predictor z and observations nested within the grouping factor g, we may write for instance nonlinear = list(alpha ~ 1, beta ~ 1 + z + (1|g), lambda ~ 1). The formula syntax described above applies here as well. In this example, we are using z and g only for the prediction of beta, but we might also use them for the other non-linear parameters (provided that the resulting model is still scientifically reasonable). Non-linear models may not be uniquely identified and / or show bad convergence. For this reason it is mandatory to specify priors on the non-linear parameters. For instructions on how to do that, see set_prior. Formula syntax for predicting auxiliary parameters It is also possible to predict auxiliary parameters of the response distribution such as the residual standard deviation sigma in gaussian models or the hurdle probability hu in hurdle models. The syntax closely resembles that of a non-linear parameter, for instance sigma ~ x + s(z) + (1+x|g). All auxiliary parameters currently supported by brmsformula have to positive (a negative standard deviation or precision parameter doesn't make any sense) or are bounded between 0 and 1 (for zero-inflated / hurdle proabilities). However, linear predictors can be positive or negative, and thus the log link (for positive parameters) or logit link (for probability parameters) are used to ensure that auxiliary parameters are within their valid intervals. This implies that effects for auxiliary parameters are estimated on the log / logit scale and one has to apply the inverse link function to get to the effects on the original scale.

Examples

Run this code

# multilevel model with smoothing terms
brmsformula(y ~ x1*x2 + s(z) + (1+x1|1) + (1|g2))

# additionally predict 'sigma'
brmsformula(y ~ x1*x2 + s(z) + (1+x1|1) + (1|g2), 
            sigma ~ x1 + (1|g2))
            
# use the shorter alias 'bf'
(formula1 <- brmsformula(y ~ x + (x|g)))
(formula2 <- bf(y ~ x + (x|g)))
# will be TRUE
identical(formula1, formula2)

# incorporate censoring
bf(y | cens(censor_variable) ~ predictors)

# define a non-linear model
bf(y ~ a1 - a2^x, nonlinear = list(a1 ~ 1, a2 ~ x + (x|g)))

# correlated group-level effects across parameters
bf(y ~ a1 - a2^x, nonlinear = list(a1 ~ 1 + (1|2|g), a2 ~ x + (x|2|g)))

# define a multivariate model
bf(cbind(y1, y2) ~ x * z + (1|g))

# define a zero-inflated model 
# also predicting the zero-inflation part
bf(y ~ x * z + (1+x|ID1|g), zi ~ x + (1|ID1|g))

# specify a predictor as monotonic
bf(y ~ mono(x) + more_predictors)

# specify a predictor as category specific
# for ordinal models only
bf(y ~ cse(x) + more_predictors)

# add a category specific group-level intercept
bf(y ~ cse(x) + (cse(1)|g))

Run the code above in your browser using DataLab