gen.data: Generate simulated data

Description

Generate data for simulations under the generalized linear model and Cox model.

Usage

gen.data(
  n,
  p,
  k = NULL,
  rho = 0,
  family = c("gaussian", "binomial", "poisson", "cox"),
  beta = NULL,
  cortype = 1,
  snr = 10,
  censoring = TRUE,
  c = 1,
  scal,
  sigma = 1,
  seed = 1
)

Arguments

The number of observations.

The number of predictors of interest.

The number of nonzero coefficients in the underlying regression model. Can be omitted if beta is supplied.

rho

A parameter used to characterize the pairwise correlation in predictors. Default is 0.

family

The distribution of the simulated data. "gaussian" for gaussian data."binomial" for binary data. "poisson" for count data. "cox" for survival data.

beta

The coefficient values in the underlying regression model.

cortype

The correlation structure. cortype = 1 denotes the exponential structure, where the covariance matrix has $(i,j)$ entry equals $rho^{|i-j|}$. codecortype = 2 denotes the constant structure, where the $(i,j)$ entry of covariance matrix is $rho$ for every $i \neq j$ and 1 elsewhere. cortype = 3 denotes the moving average structure. Details can be found below.

snr

A numerical value controlling the signal-to-noise ratio (SNR). The SNR is defined as as the variance of $x\beta$ divided by the variance of a gaussian noise: $\frac{Var(x\beta)}{\sigma^2}$. The gaussian noise $\epsilon$ is set with mean 0 and variance. The noise is added to the linear predictor $\eta$ = $x\beta$. Default is snr = 10. This option is invalid for cortype = 3.

censoring

Whether data is censored or not. Valid only for family = "cox". Default is TRUE.

The censoring rate. Default is 1.

scal

A parameter in generating survival time based on the Weibull distribution. Only used for the "cox" family.

sigma

A parameter used to control the signal-to-noise ratio. For linear regression, it is the error variance $\sigma^2$. For logistic regression and Cox's model, the larger the value of sigma, the higher the signal-to-noise ratio. Valid only for cortype = 3.

seed

seed to be used in generating the random numbers.

Value

Design matrix of predictors.

Response variable.

Tbeta

The coefficients used in the underlying regression model.

Details

We generate an $n \times p$ random Gaussian matrix $X$ with mean 0 and a covariance matrix with an exponential structure or a constant structure. For the exponential structure, the covariance matrix has $(i,j)$ entry equals $rho^{|i-j|}$. For the constant structure, the $(i,j)$ entry of the covariance matrix is $rho$ for every $i \neq j$ and 1 elsewhere. For the moving average structure, For the design matrix $X$, we first generate an $n \times p$ random Gaussian matrix $\bar{X}$ whose entries are i.i.d. $\sim N(0,1)$ and then normalize its columns to the $\sqrt n$ length. Then the design matrix $X$ is generated with $X_j = \bar{X}_j + \rho(\bar{X}_{j+1}+\bar{X}_{j-1})$ for $j=2,\dots,p-1$.

For family = "gaussian" , the data model is $$Y = X \beta + \epsilon.$$ The underlying regression coefficient $\beta$ has uniform distribution [m, 100m], $m=5 \sqrt{2log(p)/n}.$

For family= "binomial", the data model is $$Prob(Y = 1) = \exp(X \beta + \epsilon)/(1 + \exp(X \beta + \epsilon)).$$ The underlying regression coefficient $\beta$ has uniform distribution [2m, 10m], $m = 5\sigma \sqrt{2log(p)/n}.$

For family = "poisson" , the data is modeled to have an exponential distribution: $$Y = Exp(\exp(X \beta + \epsilon)).$$

For family = "cox", the data model is $$T = (-\log(S(t))/\exp(X \beta))^{1/scal}.$$ The centering time is generated from uniform distribution $[0, c]$, then we define the censor status as $\delta = I\{T \leq C\}, R = min\{T, C\}$. The underlying regression coefficient $\beta$ has uniform distribution [2m, 10m], $m = 5\sigma \sqrt{2log(p)/n}.$ In the above models, $\epsilon \sim N(0, \sigma^2 ),$ where $\sigma^2$ is determined by the snr.

References

Wen, C., Zhang, A., Quan, S. and Wang, X. (2020). BeSS: An R Package for Best Subset Selection in Linear, Logistic and Cox Proportional Hazards Models, Journal of Statistical Software, Vol. 94(4). doi:10.18637/jss.v094.i04.

Examples

Run this code

# NOT RUN {
# Generate simulated data
n <- 200
p <- 20
k <- 5
rho <- 0.4
SNR <- 10
cortype <- 1
seed <- 10
Data <- gen.data(n, p, k, rho, family = "gaussian", cortype = cortype, snr = SNR, seed = seed)
x <- Data$x[1:140, ]
y <- Data$y[1:140]
x_new <- Data$x[141:200, ]
y_new <- Data$y[141:200]
lm.bss <- bess(x, y, method = "sequential")
lambda.list <- exp(seq(log(5), log(0.1), length.out = 10))
lm.bsrr <- bess(x, y, type = "bsrr", method = "pgsection")
# }

Run the code above in your browser using DataLab