sim_mHMM: Simulate data using a multilevel hidden Markov model

Description

sim_mHMM simulates data for multiple subjects, for which the data have either categorical or continuous (i.e., normally distributed) observations that follow a hidden Markov model (HMM) with a multilevel structure. The multilevel structure implies that each subject is allowed to have its own set of parameters, and that the parameters at the subject level (level 1) are tied together by a population distribution at level 2 for each of the corresponding parameters. The shape of the population distribution for each of the parameters is a normal distribution. In addition to (natural and/or unexplained) heterogeneity between subjects, the subjects parameters can also depend on a covariate.

Usage

sim_mHMM(
  n_t,
  n,
  data_distr = "categorical",
  gen,
  gamma,
  emiss_distr,
  start_state = NULL,
  xx_vec = NULL,
  beta = NULL,
  var_gamma = 0.1,
  var_emiss = NULL,
  return_ind_par = FALSE,
  m,
  n_dep,
  q_emiss,
  log_scale = FALSE
)

Value

The following components are returned by the function sim_mHMM:

states: A matrix containing the simulated hidden state sequences, with one row per hidden state per subject. The first column indicates subject id number. The second column contains the simulated hidden state sequence, consecutively for all subjects. Hence, the id number is repeated over the rows (with the number of repeats equal to the length of the simulated hidden state sequence T for each subject).
obs: A matrix containing the simulated observed outputs, with one row per simulated observation per subject. The first column indicates subject id number. The second column contains the simulated observation sequence, consecutively for all subjects. Hence, the id number is repeated over rows (with the number of repeats equal to the length of the simulated observation sequence T for each subject).
gamma: A list containing n elements with the simulated subject specific transition probability matrices gamma. Only returned if return_ind_par is set to TRUE.
emiss_distr: A list containing n elements with the simulated subject specific emission probability matrices emiss_distr. Only returned if return_ind_par is set to TRUE.

Arguments

n_t

Numeric vector with length 1 denoting the length of the observed sequence to be simulated for each subject. To only simulate subject specific transition probability matrices gamma and emission distributions (and no data), set t to 0.

n

Numeric vector with length 1 denoting the number of subjects for which data is simulated.

data_distr

A character vector of length 1 denoting the distribution adopted for the data given the hidden states. It can take the values 'categorical', 'continuous', or 'count', standing for the for categorical observations following a Multinomial logit, continuous observations following a normal distribution, and count observations following a Poisson distribution, correspondingly.

gen

List containing the following elements denoting the general model properties:

m: numeric vector with length 1 denoting the number of hidden states
n_dep: numeric vector with length 1 denoting the number of dependent variables
q_emiss: only to be specified if the data represents categorical data. Numeric vector with length n_dep denoting the number of observed categories for the categorical emission distribution for each of the dependent variables.

gamma

A matrix with m rows and m columns containing the average population transition probability matrix used for simulating the data. That is, the probability to switch from hidden state i (row i) to hidden state j (column j).

emiss_distr

A list with n_dep elements containing the average population emission distribution(s) of the observations given the hidden states for each of the dependent variables. If data_distr = 'categorical', each element is a matrix with m rows and q_emiss[k] columns for each of the k in n_dep emission distribution(s). That is, the probability of observing category q (column q) in state i (row i). If data_distr = 'continuous', each element is a matrix with m rows and 2 columns; the first column denoting the mean of state i (row i) and the second column denoting the standard deviation of state i (row i) of the Normal distribution. If data_distr = 'count', each element is a matrix with m rows and 1 column; the only column denoting the logmean of state i (row i) of the lognormal distribution used as prior for the Poisson emissions (note: by default the logmeans should be specified in the natural scale; see argument log_scale below, unless covariates are used to predict a count emission distribution).

start_state

Optional numeric vector with length 1 denoting in which state the simulated state sequence should start. If left unspecified, the simulated state for time point 1 is sampled from the initial state distribution (which is derived from the transition probability matrix gamma).

xx_vec

List of 1 + n_dep vectors containing the covariate(s) to predict the transition probability matrix gamma and/or (specific) emission distribution(s) emiss_distr using the regression parameters specified in beta (see below). The first element in the list xx_vec is used to predict the transition matrix. Subsequent elements in the list are used to predict the emission distribution of (each of) the dependent variable(s). This means that the covariate used to predict gamma and emiss_distr can either be the same covariate, different covariates, or a covariate for certain elements and none for the other. At this point, it is only possible to use one covariate for both gamma and emiss_distr. For all elements in the list, the number of observations in the vectors should be equal to the number of subjects to be simulated n. If xx_vec is omitted completely, xx_vec defaults to NULL, resembling no covariates at all. Specific elements in the list can also be left empty (i.e., set to NULL) to signify that either the transition probability matrix or (one of) the emission distribution(s) is not predicted by covariates.

beta

List of 1 + n_dep matrices containing the regression parameters to predict gamma and/or emiss_distr in combination with xx_vec using (Multinomial logistic) regression. The first matrix is used to predict the transition probability matrix gamma. The subsequent matrices are used to predict the emission distribution(s) emiss_distr of the dependent variable(s). For gamma and categorical emission distributions, one regression parameter is specified for each element in gamma and emiss_distr, with the following exception. The first element in each row of gamma and/or emiss_distr is used as reference category in the Multinomial logistic regression. As such, no regression parameters can be specified for these parameters. Hence, the first element in the list beta to predict gamma consist of a matrix with the number of rows equal to m and the number of columns equal to m - 1. For categorical emission distributions, the subsequent elements in the list beta to predict emiss_distr consist of a matrix with the number of rows equal to m and the number of columns equal to q_emiss[k] - 1 for each of the k in n_dep emission distribution(s). See details for more information. For continuous and count emission distributions, the subsequent elements in the list beta consist of a matrix with the number of rows equal to m and 1 column.

Note that if beta is specified, xx_vec has to be specified as well. If beta is omitted completely, beta defaults to NULL, resembling no prediction of gamma and emiss_distr using covariates. One of the elements in the list can also be left empty (i.e., set to NULL) to signify that either the transition probability matrix or a specific emission distribution is not predicted by covariates. If covariates are used to predict a count emission distribution (data_distr = 'count'), then the logarithmic scale should be used for the inputs emiss_distr and var_emiss in addition to beta, and also set log_scale = TRUE.

var_gamma

Either a numeric vector with length 1 or a matrix of (m by m - 1) elements denoting the amount of variance between subjects in the transition probability matrix. Note that the value(s) correspond to the variance of the parameters of the Multinomial distribution (i.e., the intercepts of the regression equation of the Multinomial distribution used to sample the transition probability matrix), see details below. Also note that if only one variance value is provided, it will be adopted for the complete transition probability matrix, hence the variance is assumed fixed across all components. The default equals 0.1, which corresponds to little variation between subjects. If one wants to simulate data from exactly the same HMM for all subjects, var_gamma should be set to 0. Note that if data for only 1 subject is simulated (i.e., n = 1), var_gamma is set to 0.

var_emiss

Either a numeric vector with length n_dep or a list of n_dep matrices denoting the amount of variance between subjects in the emission distribution(s). When opting for a list of matrices: if data_distr = 'categorical', each element of the list is a matrix with m rows and q_emiss[k] - 1 columns for each of the k in n_dep emission distribution(s) between subject variances; if data_distr = 'continuous', each element is a matrix with m rows and 1 column; the single column denoting the variance between the subjects' means of state i (row i) of the normal distribution used as prior for the Normal emissions. If data_distr = 'count', each element is a matrix with m rows and 1 column; the single column denoting the logvariance between the subjects' logmeans of state i (row i) of the lognormal distribution used as prior for the Poisson emissions (note: by default the logvariances should be specified in the natural scale; see argument log_scale below).

For categorical data, this value corresponds to the variance of the parameters of the Multinomial distribution (i.e., the intercepts of the regression equation of the Multinomial distribution used to sample the components of the emission distribution), see details below. For continuous data, this value corresponds to the variance in the mean of the emission distribution(s) across subjects. For count data, it corresponds to the variance in the logmean of the emission distribution(s) across subjects and by should be specified in the natural scale (see argument log_scale below). Note that if only one variance value provided for each emission distribution, the variance is assumed fixed across states (and, for the categorical distribution, categories within a state) within an emission distribution. The default equals 0.1, which corresponds to little variation between subjects given categorical observations. If one wants to simulate data from exactly the same HMM for all subjects, var_emiss should be set to a vector of 0's. Note that if data for only 1 subject is simulated (i.e., n = 1), var_emiss is set to a vector of 0's.

return_ind_par

A logical scalar. Should the subject specific transition probability matrix gamma and emission probability matrix emiss_distr be returned by the function (return_ind_par = TRUE) or not (return_ind_par = FALSE). The default equals return_ind_par = FALSE.

m

The argument m is deprecated; please specify using the input parameter gen.

n_dep

The argument n_dep is deprecated; please specify using the input parameter gen.

q_emiss

The argument q_emiss is deprecated; please specify using the input parameter gen (only to be specified when simulating categorical data).

log_scale

A logical scalar. If data_distr = 'count', should emiss_distr (i.e., the sample average logmeans) and var_emiss (i.e., the between subject logvariances) of the lognormal prior adopted for the count be specified in the logarithmic scale (log_scale = TRUE) or the natural (i.e., real positive numbers) scale (log_scale = FALSE). The default equals log_scale = FALSE. Note that if covariates beta are used to predict the count emission distribution, then the logarithmic scale should be used for the inputs emiss_distr, beta, and var_emiss, and also set log_scale = TRUE.

Details

In simulating the data, having a multilevel structure means that the parameters for each subject are sampled from the population level distribution of the corresponding parameter. The user specifies the population distribution for each parameter: the average population transition probability matrix and its variance, and the average population emission distribution and its variance. For now, the variance of the mean population parameters is assumed fixed for all components of the transition probability matrix and for all components of the emission distribution.

One can simulate multivariate data. That is, the hidden states depend on more than 1 observed variable simultaneously. The distributions of multiple dependent variables for multivariate data are assumed to be independent, and all distributions for one dataset have to be of the same type (either categorical or continuous).

Note that the subject specific initial state distributions (i.e., the probability of each of the states at the first time point) needed to simulate the data are obtained from the stationary distributions of the subject specific transition probability matrices gamma.

beta: As the first element in each row of gamma is used as reference category in the Multinomial logistic regression, the first matrix in the list beta used to predict transition probability matrix gamma has a number of rows equal to m and the number of columns equal to m - 1. The first element in the first row corresponds to the probability of switching from state one to state two. The second element in the first row corresponds to the probability of switching from state one to state three, and so on. The last element in the first row corresponds to the probability of switching from state one to the last state. The same principle holds for the second matrix in the list beta used to predict categorical emission distribution(s) emiss_distr: the first element in the first row corresponds to the probability of observing category two in state one. The second element in the first row corresponds to the probability of observing category three is state one, and so on. The last element in the first row corresponds to the probability of observing the last category in state one.

Note that when simulating count data (data_distr = 'count'), by default the emission distribution parameters emiss_distr and var_emiss are specified in the natural (positive real numbers) scale (as opposite to the logarithmic scale). If the user wants to manually specify these values on the logarithmic scale, please set the argument log_scale = TRUE. Also note that if covariates are used to predict a count emission distribution, then the logarithmic scale should be used for the inputs emiss_distr, beta, and var_emiss, and also set log_scale = TRUE.

Examples

Run this code


## Examples on univariate categorical data
# simulating data for 10 subjects with each 100 categorical observations
n_t     <- 100
n       <- 10
m       <- 3
n_dep   <- 1
q_emiss <- 4
gamma   <- matrix(c(0.8, 0.1, 0.1,
                    0.2, 0.7, 0.1,
                    0.2, 0.2, 0.6), ncol = m, byrow = TRUE)
emiss_distr <- list(matrix(c(0.5, 0.5, 0.0, 0.0,
                             0.1, 0.1, 0.8, 0.0,
                             0.0, 0.0, 0.1, 0.9), nrow = m, ncol = q_emiss, byrow = TRUE))
data1 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  gamma = gamma, emiss_distr = emiss_distr, var_gamma = 1, var_emiss = 1)
head(data1$obs)
head(data1$states)

# including a covariate to predict (only) the transition probability matrix gamma
beta      <- rep(list(NULL), 2)
beta[[1]] <- matrix(c(0.5, 1.0,
                     -0.5, 0.5,
                      0.0, 1.0), byrow = TRUE, ncol = 2)
xx_vec      <- rep(list(NULL),2)
xx_vec[[1]] <-  c(rep(0,5), rep(1,5))
data2 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  gamma = gamma, emiss_distr = emiss_distr, beta = beta, xx_vec = xx_vec,
                  var_gamma = 1, var_emiss = 1)


# simulating subject specific transition probability matrices and emission distributions only
n_t <- 0
n <- 5
m <- 3
n_dep   <- 1
q_emiss <- 4
gamma <- matrix(c(0.8, 0.1, 0.1,
                  0.2, 0.7, 0.1,
                  0.2, 0.2, 0.6), ncol = m, byrow = TRUE)
emiss_distr <- list(matrix(c(0.5, 0.5, 0.0, 0.0,
                             0.1, 0.1, 0.8, 0.0,
                             0.0, 0.0, 0.1, 0.9), nrow = m, ncol = q_emiss, byrow = TRUE))
data3 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  gamma = gamma, emiss_distr = emiss_distr, var_gamma = 1, var_emiss = 1)
data3

data4 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  gamma = gamma, emiss_distr = emiss_distr, var_gamma = .5, var_emiss = .5)
data4


## Example on multivariate continuous data
# simulating multivariate continuous data
n_t     <- 100
n       <- 10
m       <- 3
n_dep   <- 2

gamma   <- matrix(c(0.8, 0.1, 0.1,
                    0.2, 0.7, 0.1,
                    0.2, 0.2, 0.6), ncol = m, byrow = TRUE)

emiss_distr <- list(matrix(c( 50, 10,
                              100, 10,
                              150, 10), nrow = m, byrow = TRUE),
                    matrix(c(5, 2,
                             10, 5,
                             20, 3), nrow = m, byrow = TRUE))

data_cont <- sim_mHMM(n_t = n_t, n = n, data_distr = 'continuous',
                      gen = list(m = m, n_dep = n_dep),
                      gamma = gamma, emiss_distr = emiss_distr,
                      var_gamma = .5, var_emiss = c(5^2, 0.2^2))

head(data_cont$states)
head(data_cont$obs)


## Example on multivariate count data without covariates
n_t     <- 200     # Number of observations on the dependent variable
m       <- 3        # Number of hidden states
n_dep   <- 2        # Number of dependent variables
n_subj  <- 30        # Number of subjects

gamma   <- matrix(c(0.9, 0.05, 0.05,
                    0.2, 0.7, 0.1,
                    0.2,0.3, 0.5), ncol = m, byrow = TRUE)

emiss_distr <- list(matrix(c(20,
                             10,
                             5), nrow = m, byrow = TRUE),
                    matrix(c(50,
                             3,
                             20), nrow = m, byrow = TRUE))

# Define between subject variance to use on the simulating function:
# here, the variance is varied over states within the dependent variable.
var_emiss <- list(matrix(c(5.0, 3.0, 1.5), nrow = m),
                  matrix(c(5.0, 5.0, 5.0), nrow = m))

# Simulate count data:
data_count <- sim_mHMM(n_t = n_t,
                       n = n_subj,
                       data_distr = "count",
                       gen = list(m = m, n_dep = n_dep),
                       gamma = gamma,
                       emiss_distr = emiss_distr,
                       var_gamma = 0.1,
                       var_emiss = var_emiss,
                       return_ind_par = TRUE)


## Example on multivariate count data with covariates
# Simulate data with one covariate for each count dependent variable

n_t     <- 200     # Number of observations on the dependent variable
m       <- 3        # Number of hidden states
n_dep   <- 3        # Number of dependent variables
n_subj  <- 30        # Number of subjects

gamma   <- matrix(c(0.9, 0.05, 0.05,
                    0.2, 0.7, 0.1,
                    0.2,0.3, 0.5), ncol = m, byrow = TRUE)

emiss_distr <- list(matrix(c(20,
                             10,
                             5), nrow = m, byrow = TRUE),
                    matrix(c(15,
                             2,
                             5), nrow = m, byrow = TRUE),
                    matrix(c(50,
                             3,
                             20), nrow = m, byrow = TRUE))

# Since we have covariates, we specify inputs in the log scale:
emiss_distr_log <- lapply(emiss_distr, function(e) log(e))

# Define list of vectors of covariate values
set.seed(42)
xx_vec <- c(list(NULL),rep(list(rnorm(n_subj,mean = 0, sd = 0.1)),3))

# Define object beta with regression coefficients for the three dependent variables
beta      <- rep(list(NULL), n_dep+1)
beta[[2]] <- matrix(c(1,-1,0), byrow = TRUE, ncol = 1)
beta[[3]] <- matrix(c(2,0,-2), byrow = TRUE, ncol = 1)
beta[[4]] <- matrix(c(-1,3,1), byrow = TRUE, ncol = 1)

# Calculate logvar to use on the simulating function:
logvar <- var_to_logvar(gen = list(m = m, n_dep = n_dep),
                        emiss_mu = emiss_distr,
                        var_emiss =  list(rep(4, m),
                                          rep(2, m),
                                          rep(5, m)),
                        byrow = FALSE)

# Put logvar in the right format:
logvar <- lapply(logvar, function(q) matrix(q, nrow = m))

# Simulate count data:
data_count <- sim_mHMM(n_t = n_t,
                       n = n_subj,
                       data_distr = "count",
                       gen = list(m = m, n_dep = n_dep),
                       gamma = gamma,
                       emiss_distr = emiss_distr_log,
                       xx_vec = xx_vec,
                       beta = beta,
                       var_gamma = 0.1,
                       var_emiss = logvar,
                       return_ind_par = TRUE,
                       log_scale = TRUE)

head(data_count$states)
head(data_count$obs)