SimulateRegression: Data simulation for multivariate regression

Description

Simulates data with outcome(s) and predictors, where only a subset of the predictors actually contributes to the definition of the outcome(s).

Usage

SimulateRegression(
  n = 100,
  pk = 10,
  xdata = NULL,
  family = "gaussian",
  q = 1,
  theta = NULL,
  nu_xy = 0.2,
  beta_abs = c(0.1, 1),
  beta_sign = c(-1, 1),
  continuous = TRUE,
  ev_xy = 0.7
)

Value

A list with:

xdata: input or simulated predictor data.
ydata: simulated outcome data.
beta: matrix of true beta coefficients used to generate outcomes in ydata from predictors in xdata.
theta: binary matrix indicating the predictors from xdata contributing to the definition of each of the outcome variables in ydata.

Arguments

n: number of observations in the simulated dataset. Not used if xdata is provided.
pk: number of predictor variables. A subset of these variables contribute to the outcome definition (see argument nu_xy). Not used if xdata is provided.
xdata: optional data matrix for the predictors with variables as columns and observations as rows. A subset of these variables contribute to the outcome definition (see argument nu_xy).
family: type of regression model. Possible values include "gaussian" for continuous outcome(s) or "binomial" for binary outcome(s).
q: number of outcome variables.
theta: binary matrix with as many rows as predictors and as many columns as outcomes. A nonzero entry on row \(i\) and column \(j\) indicates that predictor \(i\) contributes to the definition of outcome \(j\).
nu_xy: vector of length q with expected proportion of predictors contributing to the definition of each of the q outcomes.
beta_abs: vector defining the range of nonzero regression coefficients in absolute values. If continuous=FALSE, beta_abs is the set of possible precision values. If continuous=TRUE, beta_abs is the range of possible precision values. Note that regression coefficients are re-scaled if family="binomial" to ensure that the desired concordance statistic can be achieved (see argument ev_xy) so they may not be in this range.
beta_sign: vector of possible signs for regression coefficients. Possible inputs are: 1 for positive coefficients, -1 for negative coefficients, or c(-1, 1) for both positive and negative coefficients.
continuous: logical indicating whether to sample regression coefficients from a uniform distribution between the minimum and maximum values in beta_abs (if continuous=TRUE) or from proposed values in beta_abs (if continuous=FALSE).
ev_xy: vector of length q with expected goodness of fit measures for each of the q outcomes. If family="gaussian", the vector contains expected proportions of variance in each of the q outcomes that can be explained by the predictors. If family="binomial", the vector contains expected concordance statistics (i.e. area under the ROC curve) given the true probabilities.

References

ourstabilityselectionfake

Examples

Run this code

# \donttest{
## Independent predictors

# Univariate continuous outcome
set.seed(1)
simul <- SimulateRegression(pk = 15)
summary(simul)

# Univariate binary outcome
set.seed(1)
simul <- SimulateRegression(pk = 15, family = "binomial")
table(simul$ydata)

# Multiple continuous outcomes
set.seed(1)
simul <- SimulateRegression(pk = 15, q = 3)
summary(simul)


## Blocks of correlated predictors

# Simulation of predictor data
set.seed(1)
xsimul <- SimulateGraphical(pk = rep(5, 3), nu_within = 0.8, nu_between = 0, v_sign = -1)
Heatmap(cor(xsimul$data),
  legend_range = c(-1, 1),
  col = c("navy", "white", "darkred")
)

# Simulation of outcome data
simul <- SimulateRegression(xdata = xsimul$data)
print(simul)
summary(simul)


## Choosing expected proportion of explained variance

# Data simulation
set.seed(1)
simul <- SimulateRegression(n = 1000, pk = 15, q = 3, ev_xy = c(0.9, 0.5, 0.2))
summary(simul)

# Comparing with estimated proportion of explained variance
summary(lm(simul$ydata[, 1] ~ simul$xdata))
summary(lm(simul$ydata[, 2] ~ simul$xdata))
summary(lm(simul$ydata[, 3] ~ simul$xdata))


## Choosing expected concordance (AUC)

# Data simulation
set.seed(1)
simul <- SimulateRegression(
  n = 500, pk = 10,
  family = "binomial", ev_xy = 0.9
)

# Comparing with estimated concordance
fitted <- glm(simul$ydata ~ simul$xdata,
  family = "binomial"
)$fitted.values
Concordance(observed = simul$ydata, predicted = fitted)
# }

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

References

See Also

Examples