posterior_traj: Estimate the subject-specific or marginal longitudinal trajectory

Description

This function allows us to generate an estimated longitudinal trajectory (either subject-specific, or by marginalising over the distribution of the group-specific parameters) based on draws from the posterior predictive distribution.

Usage

posterior_traj(object, m = 1, newdata = NULL, newdataLong = NULL,
  newdataEvent = NULL, interpolate = TRUE, extrapolate = FALSE,
  control = list(), last_time = NULL, prob = 0.95, ids,
  dynamic = TRUE, scale = 1.5, draws = NULL, seed = NULL,
  return_matrix = FALSE, ...)

Arguments

object

A fitted model object returned by the stan_jm modelling function. See stanreg-objects.

Integer specifying the number or name of the submodel

newdata

Deprecated: please use newdataLong instead. Optionally, a data frame in which to look for variables with which to predict. If omitted, the model matrix is used. If newdata is provided and any variables were transformed (e.g. rescaled) in the data used to fit the model, then these variables must also be transformed in newdata. This only applies if variables were transformed before passing the data to one of the modeling functions and not if transformations were specified inside the model formula.

newdataLong, newdataEvent

Optionally, a data frame (or in the case of newdataLong this can be a list of data frames) in which to look for variables with which to predict. If omitted, the model matrices are used. If new data is provided, then two options are available. Either one can provide observed covariate and outcome data, collected up to some time t, and use this data to draw new individual-specific coefficients (i.e. individual-level random effects). This is the default behaviour when new data is provided, determined by the argument dynamic = TRUE, and requiring both newdataLong and newdataEvent to be specified. Alternatively, one can specify dynamic = FALSE, and then predict using just covariate data, by marginalising over the distribution of the group-specific coefficients; in this case, only newdataLong needs to be specified and it only needs to be a single data frame with the covariate data for the predictions for the one longitudinal submodel.

interpolate

A logical specifying whether to interpolate the estimated longitudinal trajectory in between the observation times. This can be used to achieve a smooth estimate of the longitudinal trajectory across the entire follow up time. If TRUE then the interpolation can be further controlled using the control argument.

extrapolate

A logical specifying whether to extrapolate the estimated longitudinal trajectory beyond the time of the last known observation time. If TRUE then the extrapolation can be further controlled using the control argument.

control

A named list with parameters controlling the interpolation or extrapolation of the estimated longitudinal trajectory when either interpolate = TRUE or extrapolate = TRUE. The list can contain one or more of the following named elements:

ipoints: a positive integer specifying the number of discrete time points at which to calculate the estimated longitudinal response for interpolate = TRUE. These time points are evenly spaced starting at 0 and ending at the last known observation time for each individual. The last observation time for each individual is taken to be either: the event or censoring time if no new data is provided; the time specified in the "last_time" column if provided in the new data (see Details section below); or the time of the last longitudinal measurement if new data is provided but no "last_time" column is included. The default is 15.
epoints: a positive integer specifying the number of discrete time points at which to calculate the estimated longitudinal response for extrapolate = TRUE. These time points are evenly spaced between the last known observation time for each individual and the extrapolation distance specifed using either edist or eprop. The default is 15.
eprop: a positive scalar between 0 and 1 specifying the amount of time across which to extrapolate the longitudinal trajectory, represented as a proportion of the total observed follow up time for each individual. For example specifying eprop = 0.2 means that for an individual for whom the latest of their measurement, event or censoring times was 10 years, their estimated longitudinal trajectory will be extrapolated out to 12 years (i.e. 10 + (0.2 * 10)). The default value is 0.2.
edist: a positive scalar specifying the amount of time across which to extrapolate the longitudinal trajectory for each individual, represented in units of the time variable time_var (from fitting the model). This cannot be specified if eprop is specified.

last_time

A scalar, character string, or NULL. This argument specifies the last known survival time for each individual when conditional predictions are being obtained. If newdataEvent is provided and conditional survival predictions are being obtained, then the last_time argument can be one of the following: (i) a scalar, this will use the same last time for each individual in newdataEvent; (ii) a character string, naming a column in newdataEvent in which to look for the last time for each individual; (iii) NULL, in which case the default is to use the time of the latest longitudinal observation in newdataLong. If newdataEvent is NULL then the last_time argument cannot be specified directly; instead it will be set equal to the event or censoring time for each individual in the dataset that was used to estimate the model. If standardised survival probabilities are requested (i.e. standardise = TRUE) then conditional survival probabilities are not allowed and therefore the last_time argument is ignored.

prob

A scalar between 0 and 1 specifying the width to use for the uncertainty interval (sometimes called credible interval) for the predicted mean response and the prediction interval for the predicted (raw) response. For example prob = 0.95 (the default) means that the 2.5th and 97.5th percentiles will be provided. Only relevant when return_matrix is FALSE.

ids

An optional vector specifying a subset of subject IDs for whom the predictions should be obtained. The default is to predict for all individuals who were used in estimating the model or, if newdata is specified, then all individuals contained in newdata.

dynamic

A logical that is only relevant if new data is provided via the newdata argument. If dynamic = TRUE, then new group-specific parameters are drawn for the individuals in the new data, conditional on their longitudinal biomarker data contained in newdata. These group-specific parameters are then used to generate individual-specific survival probabilities for these individuals. These are often referred to as "dynamic predictions" in the joint modelling context, because the predictions can be updated each time additional longitudinal biomarker data is collected on the individual. On the other hand, if dynamic = FALSE then the survival probabilities will just be marginalised over the distribution of the group-specific coefficients; this will mean that the predictions will incorporate all uncertainty due to between-individual variation so there will likely be very wide credible intervals on the predicted survival probabilities.

scale

A scalar, specifying how much to multiply the asymptotic variance-covariance matrix for the random effects by, which is then used as the "width" (ie. variance-covariance matrix) of the multivariate Student-t proposal distribution in the Metropolis-Hastings algorithm. This is only relevant when newdataEvent is supplied and dynamic = TRUE, in which case new random effects are simulated for the individuals in the new data using the Metropolis-Hastings algorithm.

draws

An integer indicating the number of MCMC draws to return. The default is to set the number of draws equal to 200, or equal to the size of the posterior sample if that is less than 200.

seed

An optional seed to use.

return_matrix

A logical. If TRUE then a draws by nrow(newdata) matrix is returned which contains all the actual simulations or draws from the posterior predictive distribution. Otherwise if return_matrix is set to FALSE (the default) then a data frame is returned, as described in the Value section below.

...

Other arguments passed to posterior_predict, for example draws, re.form, seed, etc.

Value

When return_matrix = FALSE, a data frame of class predict.stanjm. The data frame includes a column for the median of the posterior predictions of the mean longitudinal response (yfit), a column for each of the lower and upper limits of the uncertainty interval corresponding to the posterior predictions of the mean longitudinal response (ci_lb and ci_ub), and a column for each of the lower and upper limits of the prediction interval corresponding to the posterior predictions of the (raw) longitudinal response. The data frame also includes columns for the subject ID variable, and each of the predictor variables. The returned object also includes a number of attributes.

When return_matrix = TRUE, the returned object is the same as that described for posterior_predict.

Details

The posterior_traj function acts as a wrapper to the posterior_predict function, but allows predictions to be easily generated at time points that are interpolated and/or extrapolated between time zero (baseline) and the last known survival time for the individual, thereby providing predictions that correspond to a smooth estimate of the longitudinal trajectory (useful for the plotting via the associated plot.predict.stanjm method). In addition it returns a data frame by default, whereas the posterior_predict function returns a matrix; see the Value section below for details. Also, posterior_traj allows predictions to only be generated for a subset of individuals, via the ids argument.

Examples

Run this code

# NOT RUN {
  # Run example model if not already loaded
  if (!exists("example_jm")) example(example_jm)
  
  # Obtain subject-specific predictions for all individuals 
  # in the estimation dataset
  pt1 <- posterior_traj(example_jm, interpolate = FALSE, extrapolate = FALSE)
  head(pt1)
  
  # Obtain subject-specific predictions only for a few selected individuals
  pt2 <- posterior_traj(example_jm, ids = c(1,3,8))
  
  # If we wanted to obtain subject-specific predictions in order to plot the 
  # longitudinal trajectories, then we might want to ensure a full trajectory 
  # is obtained by interpolating and extrapolating time. We can then use the 
  # generic plot function to plot the subject-specific predicted trajectories
  # for the first three individuals. Interpolation and extrapolation is 
  # carried out by default.
  pt3 <- posterior_traj(example_jm)
  head(pt3) # predictions at additional time points compared with pt1 
  plot(pt3, ids = 1:3)
  
  # If we wanted to extrapolate further in time, but decrease the number of 
  # discrete time points at which we obtain predictions for each individual, 
  # then we could specify a named list in the 'control' argument
  pt4 <- posterior_traj(example_jm, control = list(ipoints = 10, epoints = 10, eprop = 0.5))
  
  # If we have prediction data for a new individual, and we want to
  # estimate the longitudinal trajectory for that individual conditional
  # on this new data (perhaps extrapolating forward from our last
  # longitudinal measurement) then we can do that. It requires drawing
  # new individual-specific parameters, based on the full likelihood,
  # so we must supply new data for both the longitudinal and event 
  # submodels. These are sometimes known as dynamic predictions.
  ndL <- pbcLong[pbcLong$id == 8, , drop = FALSE]
  ndE <- pbcSurv[pbcSurv$id == 8, , drop = FALSE]
  ndL$id <- "new_subject" # new id can't match one used in training data
  ndE$id <- "new_subject"
  pt5 <- posterior_traj(example_jm, 
                        newdataLong = ndL,
                        newdataEvent = ndE)
                        
  # By default it is assumed that the last known survival time for 
  # the individual is the time of their last biomarker measurement,
  # but if we know they survived to some later time then we can
  # condition on that information using the last_time argument
  pt6 <- posterior_traj(example_jm, 
                        newdataLong = ndL,
                        newdataEvent = ndE, 
                        last_time = "futimeYears")
  
  # Alternatively we may want to estimate the marginal longitudinal
  # trajectory for a given set of covariates. To do this, we can pass
  # the desired covariate values in a new data frame (however the only
  # covariate in our fitted model was the time variable, year). To make sure  
  # that we marginalise over the random effects, we need to specify an ID value
  # which does not correspond to any of the individuals who were used in the
  # model estimation and specify the argument dynamic=FALSE.
  # The marginal prediction is obtained by generating subject-specific 
  # predictions using a series of random draws from the random 
  # effects distribution, and then integrating (ie, averaging) over these. 
  # Our marginal prediction will therefore capture the between-individual 
  # variation associated with the random effects.
  nd <- data.frame(id = rep("new1", 11), year = (0:10 / 2))
  pt7 <- posterior_traj(example_jm, newdataLong = nd, dynamic = FALSE)
  head(pt7)  # note the greater width of the uncertainty interval compared 
             # with the subject-specific predictions in pt1, pt2, etc
  
  # Alternatively, we could have estimated the "marginal" trajectory by 
  # ignoring the random effects (ie, assuming the random effects were set 
  # to zero). This will generate a predicted longitudinal trajectory only
  # based on the fixed effect component of the model. In essence, for a 
  # linear mixed effects model (ie, a model that uses an identity link 
  # function), we should obtain a similar point estimate ("yfit") to the
  # estimates obtained in pt5 (since the mean of the estimated random effects
  # distribution will be approximately 0). However, it is important to note that
  # the uncertainty interval will be much more narrow, since it completely
  # ignores the between-individual variability captured by the random effects.
  # Further, if the model uses a non-identity link function, then the point
  # estimate ("yfit") obtained only using the fixed effect component of the
  # model will actually provide a biased estimate of the marginal prediction.
  # Nonetheless, to demonstrate how we can obtain the predictions only using 
  # the fixed effect component of the model, we simply specify 're.form = NA'. 
  # (We will use the same covariate values as used in the prediction for 
  # example for pt5).
  pt8 <- posterior_traj(example_jm, newdataLong = nd, dynamic = FALSE, 
                        re.form = NA)
  head(pt8)  # note the much narrower ci, compared with pt5
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab