augment: A fast and general method for building augmented data

Description

A fast and general method for reshaping standard longitudinal data into a new structure called augmented'. This format is suitable under a multi-state framework using the msm package.

Usage

augment(
  data,
  data_key,
  n_events,
  pattern,
  state = list("IN", "OUT", "DEAD"),
  t_start,
  t_end,
  t_cens,
  t_death,
  t_augmented,
  more_status,
  check_NA = FALSE,
  convert = FALSE,
  verbose = TRUE
)

Arguments

data

A data.table or data.frame object in longitudinal format where each row represents an observation in which the exact starting and ending time of the process are known and recorded. If data is a data.frame, then augment internally casts it to a data.table.

data_key

A keying variable which augment uses to define a key for data. This represents the subject ID (see setkey).

n_events

An integer variable indicating the progressive (monotonic) event number of a given ID. augment always checks whether n_events is monotonic increasing within the provided data_key and stops the execution in case the check fails (see 'Details'). If missing, augment fastly creates a variable named "n_events".

pattern

Either an integer, a factor or a character with 2 or 3 unique values which provides the ID status at the end of the study. pattern has a predefined structure. When 2 values are detected, they must be in the format: 0 = "alive", 1 = "dead". When 3 values are detected, then the format must be: 0 = "alive", 1 = "dead during a transition", 2 = "dead after a transition has ended" (see 'Details').

state

A list of three and exactly three possible states which a subject can reach. state has a predefined structure as follows: IN, OUT, DEAD (see 'Details').

t_start

The starting time of an observation. It can be passed as date, integer, or numeric format.

t_end

The ending time of an observation. It can be passed as date, integer, or numeric format.

t_cens

The censoring time of the study. This is the date until each ID is observed, if still active in the cohort.

t_death

The exact death time of a subject ID. If t_death is missing, t_cens is assumed to contain both censoring and death times and a warning is raised.

t_augmented

A variable indicating the name of the new time variable of the process in the augmented format. If t_augmented is missing, then the default name 'augmented' is assumed and the corresponding new variable is added to data. t_augmented is cast to integer or to numeric depending whether t_start is a date or a difftime, respectively. The suffix '_int' or '_num' is pasted to t_augmented and a new variable is computed accordingly. This is done because msm can't correctly deal with date or difftime variables. Both variables are positioned before t_start.

more_status

A variable which marks further transitions beside the default ones given by state. more_status can be a factor or a character (see 'Details'). If missing, augment ignores it.

check_NA

If TRUE, then arguments data_key, n_events, pattern, t_start and t_end are looked up for any missing data and if the function finds any, it stops with error. Default is FALSE because augment is not intended for running consistency checks, beside what is mandatory, and because the procedure is computationally onerous and could cause memory overhead for very large datasets. Argument more_status is the only one for which augment always checks for the presence of missing data and, again, if it finds any it just stops with error.

convert

If TRUE, then the returned object is automatically converted to the class data.frame. This is done in place and comes at very low cost both from running time and memory consumption (see setDF).

verbose

If FALSE, all information produced by print, cat and message are suppressed. Default is TRUE.

Value

An augmented format dataset of class data.table, or data.frame when convert is TRUE, where each row represents a specific transition for a given subject. augment returns them after some important variables have been computed:

augmented

The new timing variable for the process when looking at transitions. If t_augmented is missing, then augment creates augmented by default. augmented. The function looks directly to t_start and t_end to build it and thus it inherits their class. In particular, if t_start is a date format, then augment computes a new variable cast as integer and names it augmented_int. If t_start is a difftime format, then augment computes a new variable cast as a numeric and names it augmented_num.

status

A status flag which contains the states as specified in state. augment automatically checks whether argument pattern has 2 or 3 unique values and computes the correct structure of a given subject as reported in the vignette. The variable is cast as character.

status_num

The corresponding integer version of status.

n_status

A mix of status and n_events cast as character. This becomes useful when a multi-state model on the progression of the process needs to be implemented.

If more_status is passed, then augment computes some more variables. They mimic the meaning of status, status_num, and n_status but they account for the more complex structure defined. They are: status_exp, status_exp_num, and n_status_exp.

Details

In order to get the data processed, a monotonic increasing process needs to be ensured. In the first place, augment checks this both in case n_events is missing or not. The data are efficiently ordered through setkey function with data_key as the primary key and t_start as the secondary key. In the second place, it checks the monotonicity of n_events and if it fails, it stops with error and returns the subjects given by data_key for whom the condition is not met. If n_events is missing, then augment internally computes the progression number with the name n_events and runs the same procedure.

Attention needs to be payed to argument pattern. Integer values can be 0 and 1 if only two status are defined and they must correspond to the status 'alive' and 'dead'. If three values are defined, then they must be 0, 1 and 2 if pattern is an integer, or 'alive', 'dead inside a transition' and dead outside a transition' if pattern is either a character or a factor. The order matters: it is not possible to specify 0 as 'dead' for instance.

When passing a list of states, the order is important so that the first element must be the state corresponding to the starting time (i.e. 'IN', inside the hospital), the second element must correspond to the ending time (i.e. 'OUT', outside the hospital), and the third state is the absorbing state (i.e. 'DEAD').

more_status allows to manage multiple transitions beside what already specified in state. In particular, if the corresponding observation is a standard admission which adds no other information than what is inside state, then more_status must be set to 'df' which stands for 'Default' (see 'Examples' or run ?hosp and look at the variable 'rehab_it'). In general, it is always a good practice to fully specify the transition with a bunch of self-explanatory characters in order to quickly understand which is the current transition.

References

Jackson, C.H. (2011). Multi-State Models for Panel Data: The msm Package for R. Journal of Statistical Software, 38(8), 1-29. URL https://www.jstatsoft.org/v38/i08/.

M. Dowle, A. Srinivasan, T. Short, S. Lianoglou with contributions from R. Saporta and E. Antonyan (2016): data.table: Extension of data.frame. R package version 1.9.6 URL https://github.com/Rdatatable/data.table/wiki

Examples

Run this code

# NOT RUN {
# loading data
data( hosp )

# 1.
# augmenting hosp
hosp_augmented = augment( data = hosp, data_key = subj, n_events = adm_number,
                          pattern = label_3, t_start = dateIN, t_end = dateOUT,
                          t_cens = dateCENS )

# 2.
# augmenting hosp by passing more information regarding transitions
# with argument more_status
hosp_augmented_more = augment( data = hosp, data_key = subj, n_events = adm_number,
                               pattern = label_3, t_start = dateIN, t_end = dateOUT,
                               t_cens = dateCENS, more_status = rehab_it )
# 3.
# augmenting hosp and returning a data.frame
hosp_augmented = augment( data = hosp, data_key = subj, n_events = adm_number,
                          pattern = label_3, t_start = dateIN, t_end = dateOUT,
                          t_cens = dateCENS, convert = TRUE )
class( hosp_augmented )

# }

Run the code above in your browser using DataLab