Generate data for simulations under the generalized linear model and Cox model.
gen.data(
n,
p,
k = NULL,
rho = 0,
family = c("gaussian", "binomial", "poisson", "cox"),
beta = NULL,
cortype = 1,
snr = 10,
censoring = TRUE,
c = 1,
scal,
sigma = 1,
seed = 1
)
The number of observations.
The number of predictors of interest.
The number of nonzero coefficients in the underlying regression
model. Can be omitted if beta
is supplied.
A parameter used to characterize the pairwise correlation in
predictors. Default is 0
.
The distribution of the simulated data. "gaussian"
for
gaussian data."binomial"
for binary data. "poisson"
for count data. "cox"
for survival data.
The coefficient values in the underlying regression model.
The correlation structure. cortype = 1
denotes the exponential structure,
where the covariance matrix has \((i,j)\) entry equals \(rho^{|i-j|}\).
codecortype = 2 denotes the constant structure, where the \((i,j)\) entry of covariance
matrix is \(rho\) for every \(i \neq j\) and 1 elsewhere. cortype = 3
denotes the moving average
structure. Details can be found below.
A numerical value controlling the signal-to-noise ratio (SNR). The SNR is defined as
as the variance of \(x\beta\) divided
by the variance of a gaussian noise: \(\frac{Var(x\beta)}{\sigma^2}\).
The gaussian noise \(\epsilon\) is set with mean 0 and variance.
The noise is added to the linear predictor \(\eta\) = \(x\beta\). Default is snr = 10
.
This option is invalid for cortype = 3
.
Whether data is censored or not. Valid only for family = "cox"
. Default is TRUE
.
The censoring rate. Default is 1
.
A parameter in generating survival time based on the Weibull distribution. Only used for the "cox
" family.
A parameter used to control the signal-to-noise ratio. For linear regression,
it is the error variance \(\sigma^2\). For logistic regression and Cox's model,
the larger the value of sigma, the higher the signal-to-noise ratio. Valid only for cortype = 3
.
seed to be used in generating the random numbers.
Design matrix of predictors.
Response variable.
The coefficients used in the underlying regression model.
We generate an \(n \times p\) random Gaussian matrix \(X\) with mean 0 and a covariance matrix with an exponential structure or a constant structure. For the exponential structure, the covariance matrix has \((i,j)\) entry equals \(rho^{|i-j|}\). For the constant structure, the \((i,j)\) entry of the covariance matrix is \(rho\) for every \(i \neq j\) and 1 elsewhere. For the moving average structure, For the design matrix \(X\), we first generate an \(n \times p\) random Gaussian matrix \(\bar{X}\) whose entries are i.i.d. \(\sim N(0,1)\) and then normalize its columns to the \(\sqrt n\) length. Then the design matrix \(X\) is generated with \(X_j = \bar{X}_j + \rho(\bar{X}_{j+1}+\bar{X}_{j-1})\) for \(j=2,\dots,p-1\).
For family = "gaussian"
, the data model is $$Y = X \beta +
\epsilon.$$
The underlying regression coefficient \(\beta\) has uniform distribution [m, 100m], \(m=5 \sqrt{2log(p)/n}.\)
For family= "binomial"
, the data model is $$Prob(Y = 1) = \exp(X
\beta + \epsilon)/(1 + \exp(X \beta + \epsilon)).$$
The underlying regression coefficient \(\beta\) has uniform distribution [2m, 10m], \(m = 5\sigma \sqrt{2log(p)/n}.\)
For family = "poisson"
, the data is modeled to have an exponential distribution: $$Y = Exp(\exp(X \beta +
\epsilon)).$$
For family = "cox"
, the data model is
$$T = (-\log(S(t))/\exp(X \beta))^{1/scal}.$$
The centering time is generated from uniform distribution \([0, c]\),
then we define the censor status as \(\delta = I\{T \leq C\}, R = min\{T, C\}\).
The underlying regression coefficient \(\beta\) has uniform distribution [2m, 10m], \(m = 5\sigma \sqrt{2log(p)/n}.\)
In the above models, \(\epsilon \sim N(0,
\sigma^2 ),\) where \(\sigma^2\) is determined by the snr
.
Wen, C., Zhang, A., Quan, S. and Wang, X. (2020). BeSS: An R Package for Best Subset Selection in Linear, Logistic and Cox Proportional Hazards Models, Journal of Statistical Software, Vol. 94(4). doi:10.18637/jss.v094.i04.
# NOT RUN {
# Generate simulated data
n <- 200
p <- 20
k <- 5
rho <- 0.4
SNR <- 10
cortype <- 1
seed <- 10
Data <- gen.data(n, p, k, rho, family = "gaussian", cortype = cortype, snr = SNR, seed = seed)
x <- Data$x[1:140, ]
y <- Data$y[1:140]
x_new <- Data$x[141:200, ]
y_new <- Data$y[141:200]
lm.bss <- bess(x, y, method = "sequential")
lambda.list <- exp(seq(log(5), log(0.1), length.out = 10))
lm.bsrr <- bess(x, y, type = "bsrr", method = "pgsection")
# }
Run the code above in your browser using DataLab