ltrcrrf: Fit a LTRC relative risk forest

Description

An implementation of the random forest algorithms utilizing LTRC rpart trees LTRCART as base learners for left-truncated right-censored survival data with time-invariant covariates. It also allows for (left-truncated) right-censored survival data with time-varying covariates.

Usage

ltrcrrf(
  formula,
  data,
  id,
  ntree = 100L,
  mtry = NULL,
  nodesize = max(ceiling(sqrt(nrow(data))), 15),
  bootstrap = c("by.sub", "by.root", "by.node", "by.user", "none"),
  samptype = c("swor", "swr"),
  sampfrac = 0.632,
  samp = NULL,
  na.action = "na.omit",
  stepFactor = 2,
  trace = TRUE,
  nodedepth = NULL,
  nsplit = 10L,
  ntime
)

Value

An object belongs to the class ltrcrrf, as a subclass of rfsrc.

Arguments

formula: a formula object, with the response being a Surv object, with form
data: a data frame containing n rows of left-truncated right-censored observations. For time-varying data, this should be a data frame containing pseudo-subject observations based on the Andersen-Gill reformulation.
id: variable name of subject identifiers. If this is present, it will be searched for in the data data frame. Each group of rows in data with the same subject id represents the covariate path through time of a single subject. If not specified, the algorithm then assumes data contains left-truncated and right-censored survival data with time-invariant covariates.
ntree: an integer, the number of the trees to grow for the forest. ntree = 100L is set by default.
mtry: number of input variables randomly sampled as candidates at each node for random forest like algorithms. The default mtry is tuned by tune.ltrcrrf.
nodesize: an integer, forest average terminal node size.
bootstrap: bootstrap protocol. (1) If id is present, the choices are: "by.sub" (by default) which bootstraps subjects, "by.root" which bootstraps pseudo-subjects. Both can be with or without replacement (by default sampling is without replacement; see the option samptype below). (2) If id is not specified, the default is "by.root" which bootstraps the data by sampling with or without replacement; if "by.node" is choosen, data is bootstrapped with replacement at each node while growing the tree. Regardless of the presence of id, if "none" is chosen, data is not bootstrapped at all, and is used in every individual tree. If "by.user" is choosen, the bootstrap specified by samp is used.
samptype: choices are swor (sampling without replacement) and swr (sampling with replacement). The default action here is sampling without replacement.
sampfrac: a fraction, determining the proportion of subjects to draw without replacement when samptype = "swor". The default value is 0.632. To be more specific, if id is present, 0.632 * N of subjects with their pseudo-subject observations are drawn without replacement (N denotes the number of subjects); otherwise, 0.632 * n is the requested size of the sample.
samp: Bootstrap specification when bootstype = "by.user". Array of dim n x ntree specifying how many times each record appears in each bootstrap sample.
na.action: action taken if the data contains NA’s. The default "na.omit" removes the entire record if any of its entries is NA (for x-variables this applies only to those specifically listed in formula). See function rfsrc for other available options.
stepFactor: at each iteration, mtry is inflated (or deflated) by this value, used when mtry is not specified (see tune.ltrcrrf). The default value is 2.
trace: whether to print the progress of the search of the optimal value of mtry if mtry is not specified (see tune.ltrcrrf). trace = TRUE is set by default.
nodedepth: maximum depth to which a tree should be grown. The default behaviour is that this parameter is ignored.
nsplit: an non-negative integer value for number of random splits to consider for each candidate splitting variable. This significantly increases speed. When zero or NULL, the algorithm uses much slower deterministic splitting where all possible splits are considered. nsplit = 10L by default.
ntime: an integer value used for survival to constrain ensemble calculations to a grid of ntime time points. Alternatively if a vector of values of length greater than one is supplied, it is assumed these are the time points to be used to constrain the calculations (note that the constrained time points used will be the observed event times closest to the user supplied time points). If no value is specified, the default action is to use all observed event times. Further demails can be found in rfsrc.

Details

This function extends the relative risk forest algorithm (Ishwaran et al. 2004) to fit left-truncated and right-censored data, which allows for time-varying covariates. The algorithm is built based on employing the fast C code from rfsrc.

References

Andersen, P. and Gill, R. (1982). Cox’s regression model for counting processes, a large sample study. Annals of Statistics, 10:1100-1120.

H. Ishwaran, E. H. Blackstone, C. Pothier, and M. S. Lauer. (2004). Relative risk forests for exercise heart rate recovery as a predictor of mortality. Journal of the American StatisticalAssociation, 99(1):591–600.

Fu, W. and Simonoff, J.S. (2016). Survival trees for left-truncated and right-censored data, with application to time-varying covariate data. Biostatistics, 18(2):352–369.

Examples

Run this code

#### Example with time-varying data pbcsample
library(survival)
Formula = Surv(Start, Stop, Event) ~ age + alk.phos + ast + chol + edema
# Built a LTRCRRF forest (based on bootstrapping subjects without replacement)
# on the time-varying data by specifying id:
LTRCRRFobj = ltrcrrf(formula = Formula, data = pbcsample, id = ID, stepFactor = 3,
                     ntree = 10L)

Run the code above in your browser using DataLab