Uses rstan
with pre-compiled Stan code, or cmdstan
to get posterior draws of parameters from a binary logistic or proportional odds semiparametric ordinal logistic model. The Stan code internally using the qr decompositon on the design matrix so that highly collinear columns of the matrix do not hinder the posterior sampling. The parameters are transformed back to the original scale before returning results to R. Design matrix columns are centered before running Stan, so Stan diagnostic output will have the intercept terms shifted but the results of blrm()
for intercepts are for the original uncentered data. The only prior distributions for regression betas are normal with mean zero. Priors are specified on contrasts so that they can be specified on a meaningful scale and so that more complex patterns can be imposed. Parameters that are not involved in any contrasts in pcontrast
or npcontrast
have non-informative priors. Contrasts are automatically converted to the QR space used in Stan code.
blrm(
formula,
ppo = NULL,
cppo = NULL,
data = environment(formula),
subset,
na.action = na.delete,
priorsdppo = rep(100, pppo),
iprior = 0,
conc = 1/(0.8 + 0.35 * max(k, 3)),
ascale = 1,
psigma = 1,
rsdmean = if (psigma == 1) 0 else 1,
rsdsd = 1,
normcppo = FALSE,
pcontrast = NULL,
npcontrast = NULL,
backend = c("rstan", "cmdstan"),
iter = 2000,
warmup = iter/2,
chains = 4,
refresh = 0,
progress = if (refresh > 0) "stan-progress.txt" else "",
x = TRUE,
y = TRUE,
loo = n
an rms
fit object of class blrm
, rmsb
, rms
that also contains rstan
or cmdstanr
results under the name rstan
. In the rstan
results, which are also used to produce diagnostics, the intercepts are shifted because of the centering of columns of the design matrix done by blrm()
. With method='optimizing'
a class-less list is return with these elements: coefficients
(MLEs), beta
(non-intercept parameters on the QR decomposition scale), deviance
(-2 log likelihood), return_code
(see rstan::optimizing()
), and, if you specified hessian=TRUE
to blrm()
, the Hessian matrix. To learn about the scaling of orthogonalized QR design matrix columns, look at the xqrsd
object in the returned object. This is the vector of SDs for all the columns of the transformed matrix. The returned element sampling_time
is the elapsed time for running posterior samplers, in seconds. This will be just a little more than the time for running one CPU core for one chain.
a R formula object that can use rms
package enhancements such as the restricted interaction operator
formula specifying the model predictors for which proportional odds is not assumed
a function that if present causes a constrained partial PO model to be fit. The function specifies the values in the Gamma vector in Peterson and Harrell (1990) equation (6). Sometimes to make posterior sampling better behaved, the function should be scaled and centered. This is done by wrapping cppo
in a function that scales the cppo
result before returning the vector value, when normcppo
is TRUE
. The default normalization is based on the mean and standard deviation of the function values over the distribution of observed Y. For getting predicted values and estimates post-blrm()
, cppo
must not reference any functions that are not available at such later times.
a data frame; defaults to using objects from the calling environment
a logical vector or integer subscript vector specifying which subset of data whould be used
default is na.delete
to remove missings and report on them
vector of prior standard deviations for non-proportional odds parameters. The last element is the only one for which the SD corresponds to the original data scale. This only applies to the unconstrained PPO model.
specifies whether to use a Dirichlet distribution for the cell probabilities, which induce a more complex prior distribution for the intercepts (iprior=0
, the default), non-informative priors (iprior=1
) directly on the intercept parameters, or to directly use a t-distribution with 3 d.f. and scale parameter ascale
(iprior=2
).
the Dirichlet distribution concentration parameter for the prior distribution of cell probabilities at covariate means. The default is the reciprocal of 0.8 + 0.35 max(k, 3) where k is the number of Y categories. The default is chosen to make the posterior mean of the intercepts more closely match the MLE. For optimizing, the concentration parameter is always 1.0 when iprior=0
to obtain results very close to the MLE for providing the posterior mode.
scale parameter for the t-distribution for priors for the intercepts if iprior=2
, defaulting to 1.0
defaults to 1 for a half-t distribution with 4 d.f., location parameter rsdmean
and scale parameter rsdsd
. Set psigma=2
to use the exponential distribution.
the assumed mean of the prior distribution of the standard deviation of random effects. When psigma=2
this is the mean of an exponential distribution and defaults to 1. When psigma=1
this is the mean of the half-t distribution and defaults to zero.
applies only to psigma=1
and is the scale parameter for the half t distribution for the SD of random effects, defaulting to 1.
set to TRUE
to modify the cppo
function automatically centering and scaling the result
a list specifying contrasts that are to be given Gaussian prior distributions. The predictor combinations specified in pcontrast
are run through rms::gendata()
so that contrasts are specified in units of original variables, and unspecified variables are set to medians or modes as saved by rms::datadist()
. Thanks to Stan
, putting priors on combinations and transformations of model parameters has the same effect of putting different priors on the original parameters without figuring out how to do that. The syntax used here allows specification of differences, double differences (e.g., interactions or nonlinearity), triple differences (e.g., to put contraints on nonlinear interactions), etc. The requested predictor combinations must be named so they may be referred to inside contrast
. The syntax is pcontrast=list(..., contrast=expression(...), mu=, sd=, weights=, expand=)
. ...
denotes one or more list()
s with predictor combinations, and each list()
must be named, e.g., pcontrast=list(c1=list(sex='female'), c2=list(sex='male'))
to set up for a female - male
contrast specified as contrast=expression(c1 - c2)
. The c1 - c2
subtraction will operate on the design matrices generated by the covariate settings in the list()
s. For weights, expand
see rms::Xcontrast()
and rms::contrast.rms()
. mu
is a vector of prior means associated with the rows of the stacked contrasts, and sd
is a corresponding vector of Gaussian prior SDs. When mu
is not given it defaults to 0.0, and sd
defaults to 100.0. Values of mu
and/or sd
are repeated to the number of contrasts if they are of length 1. Full examples are given here.
like pcontrast
but applies to the non-proportional odds submodel in ppo
. Priors for the amount of departure from proportional odds are isolated from the priors of the "main effects" in formula
. The mean and standard deviation for the non-PO contrasts are on the scale of Z*tau before cppo
is applied. If cppo
picks off a single condition, i.e., death is the highest level of Y and you want a special effect of treatment on death, then cppo
will be something like function(y) y == 4
and the contrast prior will be on the scale of the additional treatment effect for death. If cppo
is more of a continuous function you will have to take into account the values of that function when figuring the prior mean and SD. For example, if y ranges from 10-90 and cppo
is sqrt(y)
, and you want to specify a prior on the log odds ratio for y=10 vs. y=90 you'll need to divide the prior standard deviation in npcontrast
by sqrt(90) - sqrt(10)
.
set to cmdstan
to use cmdstan
through the R cmdstanr
package instead of the default rstan
. You can also specify this with a global option rmsb.backend
.
number of posterior samples per chain for rstan::sampling()
to run, counting warmups
number of warmup iterations to discard. Default is iter
/2.
number of separate chains to run
see rstan::sampling()
and cmdstanr::sample()
. The default is 0, indicating that no progress notes are output. If refresh > 0
and progress
is not ''
, progress output will be appended to file progress
. The default file name is 'stan-progress.txt'
.
see refresh
. Defaults to ''
if refresh = 0
. Note: If running interactively but not under RStudio, rstan
will open a browser window for monitoring progress.
set to FALSE
to not store the design matrix in the fit. x=TRUE
is needed if running blrmStats
for example.
set to FALSE
to not store the response variable in the fit
set to FALSE
to not run loo
and store its result as object loo
in the returned object. loo
defaults to FALSE
if the sample size is greater than 1000, as loo
requires the per-observation likelihood components, which creates a matrix N times the number of posterior draws.
set to a file name to run rstan
pairs
or, if backend='cmdstan'
bayesplot::mcmc_pairs
and store the resulting png plot there. Set to TRUE
instead to directly plot these diagnostics. The default is not to run pair plots.
set to 'optimizing'
to run the Stan optimizer and not do posterior sampling, 'both'
(the default) to run both the optimizer and posterior sampling, or 'sampling'
to run only the posterior sampling and not compute posterior modes. Running optimizing
is a way to obtain maximum likelihood estimates and allows one to quickly study the effect of changing the prior distributions. When method='optimizing'
is used the result returned is not a standard blrm()
object but is instead the parameter estimates, -2 log likelihood, and optionally the Hession matrix (if you specify hessian=TRUE
in ...; not available with cmdstan
). When method='both'
is used, rstan::sampling()
and rstan::optimizing()
are both run, and parameter estimates (posterior modes) from optimizing
are stored in a matrix param
in the fit object, which also contains the posterior means and medians, and other results from optimizing
are stored in object opt
in the blrm()
fit object. When random effects are present, method
is automatically set to 'sampling'
as maximum likelihood estimates without marginalizing over the random effects do not make sense. When you specify method='optimizing'
specify iprior=
to get regular MLEs in which no prior is put on the intercepts.
intial value for optimization. The default is the rstan
default 'random'
. Frequently specifying init=0
will benefit when the number of distinct Y categories grows or when using ppo
hence 0 is the default for that.
initial value for sampling, defaults to inito
set to TRUE
to return the Stan data list and not run the model
a file name for a saveRDS
-created file containing or to contain the saved fit object. If file
is specified and the file does not exist, it will be created right before the fit object is returned, less the large rstan
object. If the file already exists, its stored md5
hash string datahash
fit object component is retrieved and compared to that of the current rstan
inputs. If the data to be sent to rstan
, the priors, and all sampling and optimization options and stan code are identical, the previously stored fit object is immediately returned and no new calculatons are done.
set to TRUE
to output timing and progress information to /tmp/debug.txt
a list containing parameters to pass to rstan::sampling()
or to the rcmdstan
sample
function, other than these arguments: iter, warmup, chains, refresh, init
which are already arguments to blrm
set to TRUE
to show Stan optimizer output
passed to rstan::optimizing()
or the rcmdstan
optimizing function. The seed
parameter is a popular example.
Frank Harrell and Ben Goodrich
The partial proportional odds model of Peterson and Harrell (1990) is implemented, and is invoked when the user specifies a second model formula as the ppo
argument. This formula has no left-hand-side variable, and has right-side variables that are a subset of those in formula
specifying for which predictors the proportional odds assumption is relaxed.
The Peterson and Harrell (1990) constrained partial proportional odds is also implemented, and is usually preferred to the above unconstrained PPO model as it adds a vector of coefficients instead of a matrix of coefficients. In the constrained PPO model the user provides a function cppo
that computes a score for all observed values of the dependent variable. For example with a discrete ordinal outcome cppo
may return a value of 1.0 for a specific value of Y and zero otherwise. That will result in a departure from the proportional odds assumption for just that one level of Y. The value returned by cppo
at the lowest Y value is never used in any case.
blrm()
also handles single-level hierarchical random effects models for the case when there are repeated measurements per subject which are reflected as random intercepts, and a different experimental model that allows for AR(1) serial correlation within subject. For both setups, a cluster
term in the model signals the existence of subject-specific random effects.
When using the cmdstan
backend, cmdstanr
will need to compile the Stan code once per computer, only recompiling the code when the Stan source code changes. By default the compiled code is stored in directory .rmsb
under your home directory. Specify options(rmsbdir=)
to specify a different location. You should specify rmsbdir
to be in a project-specific location if you want to archive code for old projects.
If you want to run MCMC sampling even when no inputs or Stan code have changed, i.e., to use a different random number seed for the sampling process, remove the file
before running blrm
.
Set options(rmsbmsg=FALSE)
to suppress certain information messages.
print.blrm()
, blrmStats()
, stanDx()
, stanGet()
, coef.rmsb()
, vcov.rmsb()
, print.rmsb()
, coef.rmsb()
if (FALSE) {
getHdata(titanic3)
dd <- datadist(titanic3); options(datadist='dd')
f <- blrm(survived ~ (rcs(age, 5) + sex + pclass)^2, data=titanic3)
f # model summary using print.blrm
coef(f) # compute posterior mean parameter values
coef(f, 'median') # compute posterior median values
stanDx(f) # print basic Stan diagnostics
s <- stanGet(f) # extract rstan object from fit
plot(s, pars=f$betas) # Stan posteriors for beta parameters
stanDxplot(s) # Stan diagnostic plots by chain
blrmStats(f) # more details about predictive accuracy measures
ggplot(Predict(...)) # standard rms output
summary(f, ...) # invokes summary.rms
contrast(f, ...) # contrast.rms computes HPD intervals
plot(nomogram(f, ...)) # plot nomogram using posterior mean parameters
# Fit a random effects model to handle multiple observations per
# subject ID using cmdstan
# options(rmsb.backend='cmdstan')
f <- blrm(outcome ~ rcs(age, 5) + sex + cluster(id), data=mydata)
}
Run the code above in your browser using DataLab