gen.data: Generate simulated data

Description

Generate data for simulations under the generalized linear model and Cox model.

Usage

gen.data(n, p, family, K, rho = 0, sigma = 1, beta = NULL, censoring = TRUE,
           c = 1, scal)

Value

A list with the following components: x, y, Tbeta.

x: Design matrix of predictors.
y: Response variable
Tbeta: The coefficients used in the underlying regression model.

Arguments

n: The number of observations.
p: The number of predictors of interest.
family: The distribution of the simulated data. "gaussian" for gaussian data."binomial" for binary data. "cox" for survival data
K: The number of nonzero coefficients in the underlying regression model.
rho: A parameter used to characterize the pairwise correlation in predictors. Default is 0.
sigma: A parameter used to control the signal-to-noise ratio. For linear regression, it is the error variance $\sigma^2$. For logistic regression and Cox's model, the larger the value of sigma, the higher the signal-to-noise ratio.
beta: The coefficient values in the underlying regression model.
censoring: Whether data is censored or not. Default is TRUE
c: The censoring rate. Default is 1.
scal: A parameter in generating survival time based on the Weibull distribution. Only used for the "cox" family.

Author

Canhong Wen, Aijun Zhang, Shijie Quan, and Xueqin Wang.

Details

For the design matrix $X$, we first generate an n x p random Gaussian matrix $\bar{X}$ whose entries are i.i.d. $\sim N(0,1)$ and then normalize its columns to the $\sqrt n$ length. Then the design matrix $X$ is generated with $X_j = \bar{X}_j + \rho(\bar{X}_{j+1}+\bar{X}_{j-1})$ for $j=2,\dots,p-1$.

For "gaussian" family, the data model is $$Y = X \beta + \epsilon, where \epsilon \sim N(0, \sigma^2 ).$$ The underlying regression coefficient $\beta$ has uniform distribution [m, 100m], $m=5 \sqrt{2log(p)/n}.$

For "binomial" family, the data model is $$Prob(Y = 1) = exp(X \beta)/(1 + exp(X \beta))$$ The underlying regression coefficient $\beta$ has uniform distribution [2m, 10m], $m = 5\sigma \sqrt{2log(p)/n}.$

For "cox" family, the data model is $$T = (-log(S(t))/exp(X \beta))^(1/scal),$$ The centerning time C is generated from uniform distribution [0, c], then we define the censor status as $\delta = I{T <= C}, R = min{T, C}$. The underlying regression coefficient $\beta$ has uniform distribution [2m, 10m], $m = 5\sigma \sqrt{2log(p)/n}.$

References

Wen, C., Zhang, A., Quan, S. and Wang, X. (2020). BeSS: An R Package for Best Subset Selection in Linear, Logistic and Cox Proportional Hazards Models, Journal of Statistical Software, Vol. 94(4). doi:10.18637/jss.v094.i04.

Examples

Run this code


# Generate simulated data
n <- 500
p <- 20
K <-10
sigma <- 1
rho <- 0.2
data <- gen.data(n, p, family = "gaussian", K, rho, sigma)

# Best subset selection
fit <- bess(data$x, data$y, family = "gaussian")

Run the code above in your browser using DataLab