Last chance! 50% off unlimited learning
Sale ends in
Best subset selection with a specified model size for generalized linear models and Cox's proportional hazard model.
bess.one(
x,
y,
family = c("gaussian", "binomial", "poisson", "cox"),
type = c("bss", "bsrr"),
s,
lambda = 0,
always.include = NULL,
screening.num = NULL,
normalize = NULL,
weight = NULL,
max.iter = 20,
group.index = NULL
)
Input matrix, of dimension
The response variable, of n
observations. For family = "binomial"
should be
a factor with two levels. For family="poisson"
, y
should be a vector with positive integer.
For family = "cox"
, y
should be a two-column matrix
with columns named time
and status
.
One of the following models: "gaussian"
, "binomial"
,
"poisson"
, or "cox"
. Depending on the response.
One of the two types of problems.
type = "bss"
for the best subset selection,
and type = "bsrr"
for the best subset ridge regression.
A specified model size
A shrinkage parameter for "bsrr"
.
An integer vector containing the indexes of variables that should always be included in the model.
Users can pre-exclude some irrelevant variables according to maximum marginal likelihood estimators before fitting a
model by passing an integer to screening.num
and the sure independence screening will choose a set of variables of this size.
Then the active set updates are restricted on this subset.
Options for normalization. normalize = 0
for
no normalization. Setting normalize = 1
will
only subtract the mean of columns of x
.
normalize = 2
for scaling the columns of x
to have normalize = 3
for subtracting the means of the columns of x
and y
, and also
normalizing the columns of x
to have normalize = NULL
, by default, normalize
will be set 1
for "gaussian"
,
2
for "binomial"
and "poisson"
, 3
for "cox"
.
Observation weights. Default is 1
for each observation.
The maximum number of iterations in the bess function.
In most of the case, only a few steps can guarantee the convergence. Default
is 20
.
A vector of integers indicating the which group each variable is in.
For variables in the same group, they should be located in adjacent columns of x
and their corresponding index in group.index
should be the same.
Denote the first group as 1
, the second 2
, etc.
If you do not fit a model with a group structure,
please set group.index = NULL
. Default is NULL
.
A list with class attribute 'bess' and named components:
The best fitting coefficients.
The best fitting intercept.
The best fitted model for type = "bss"
, the class of which is "lm"
, "glm"
or "coxph"
.
The training loss of the fitting model.
The model size.
The shrinkage parameter.
Type of the model.
The sample size.
Either "bss"
or "bsrr"
.
Given a model size
In the GLM case,
The best subset selection problem is solved by the primal dual active set algorithm, see Wen et al. (2017) for details. This algorithm utilizes an active set updating strategy via primal and dual variables and fits the sub-model by exploiting the fact that their support set are non-overlap and complementary.
Wen, C., Zhang, A., Quan, S. and Wang, X. (2020). BeSS: An R Package for Best Subset Selection in Linear, Logistic and Cox Proportional Hazards Models, Journal of Statistical Software, Vol. 94(4). doi:10.18637/jss.v094.i04.
# NOT RUN {
#-------------------linear model----------------------#
# Generate simulated data
n <- 200
p <- 20
k <- 5
rho <- 0.4
seed <- 10
Tbeta <- rep(0, p)
Tbeta[1:k*floor(p/k):floor(p/k)] <- rep(1, k)
Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed)
x <- Data$x[1:140, ]
y <- Data$y[1:140]
x_new <- Data$x[141:200, ]
y_new <- Data$y[141:200]
lm.bss <- bess.one(x, y, s = 5)
lm.bsrr <- bess.one(x, y, type = "bsrr", s = 5, lambda = 0.01)
coef(lm.bss)
coef(lm.bsrr)
print(lm.bss)
print(lm.bsrr)
summary(lm.bss)
summary(lm.bsrr)
pred.bss <- predict(lm.bss, newx = x_new)
pred.bsrr <- predict(lm.bsrr, newx = x_new)
#-------------------logistic model----------------------#
#Generate simulated data
Data <- gen.data(n, p, k, rho, family = "binomial", beta = Tbeta, seed = seed)
x <- Data$x[1:140, ]
y <- Data$y[1:140]
x_new <- Data$x[141:200, ]
y_new <- Data$y[141:200]
logi.bss <- bess.one(x, y, family = "binomial", s = 5)
logi.bsrr <- bess.one(x, y, type = "bsrr", family = "binomial", s = 5, lambda = 0.01)
coef(logi.bss)
coef(logi.bsrr)
print(logi.bss)
print(logi.bsrr)
summary(logi.bss)
summary(logi.bsrr)
pred.bss <- predict(logi.bss, newx = x_new)
pred.bsrr <- predict(logi.bsrr, newx = x_new)
#-------------------poisson model----------------------#
Data <- gen.data(n, p, k, rho=0.3, family = "poisson", beta = Tbeta, seed = seed)
x <- Data$x[1:140, ]
y <- Data$y[1:140]
x_new <- Data$x[141:200, ]
y_new <- Data$y[141:200]
poi.bss <- bess.one(x, y, family = "poisson", s=5)
lambda.list <- exp(seq(log(5), log(0.1), length.out = 10))
poi.bsrr <- bess.one(x, y, type = "bsrr", family = "poisson", s = 5, lambda = 0.01)
coef(poi.bss)
coef(poi.bsrr)
print(poi.bss)
print(poi.bsrr)
summary(poi.bss)
summary(poi.bsrr)
pred.bss <- predict(poi.bss, newx = x_new)
pred.bsrr <- predict(poi.bsrr, newx = x_new)
#-------------------coxph model----------------------#
#Generate simulated data
Data <- gen.data(n, p, k, rho, family = "cox", beta = Tbeta, scal = 10)
x <- Data$x[1:140, ]
y <- Data$y[1:140, ]
x_new <- Data$x[141:200, ]
y_new <- Data$y[141:200, ]
cox.bss <- bess.one(x, y, family = "cox", s = 5)
cox.bsrr <- bess.one(x, y, type = "bsrr", family = "cox", s = 5, lambda = 0.01)
coef(cox.bss)
coef(cox.bsrr)
print(cox.bss)
print(cox.bsrr)
summary(cox.bss)
summary(cox.bsrr)
pred.bss <- predict(cox.bss, newx = x_new)
pred.bsrr <- predict(cox.bsrr, newx = x_new)
#----------------------High dimensional linear models--------------------#
# }
# NOT RUN {
data <- gen.data(n, p = 1000, k, family = "gaussian", seed = seed)
# Best subset selection with SIS screening
fit <- bess.one(data$x, data$y, screening.num = 100, s = 5)
# }
# NOT RUN {
#-------------------group selection----------------------#
beta <- rep(c(rep(1,2),rep(0,3)), 4)
Data <- gen.data(200, 20, 5, rho=0.4, beta = beta, seed =10)
x <- Data$x
y <- Data$y
group.index <- c(rep(1, 2), rep(2, 3), rep(3, 2), rep(4, 3),
rep(5, 2), rep(6, 3), rep(7, 2), rep(8, 3))
lm.group <- bess.one(x, y, s = 5, type = "bss", group.index = group.index)
lm.groupbsrr <- bess.one(x, y, type = "bsrr", s = 5, lambda = 0.01, group.index = group.index)
coef(lm.group)
coef(lm.groupbsrr)
print(lm.group)
print(lm.groupbsrr)
summary(lm.group)
summary(lm.groupbsrr)
pred.group <- predict(lm.group, newx = x_new)
pred.groupl0l2 <- predict(lm.groupbsrr, newx = x_new)
#-------------------include specified variables----------------------#
Data <- gen.data(n, p, k, rho, family = "gaussian", beta = Tbeta, seed = seed)
lm.bss <- bess.one(Data$x, Data$y, s = 5, always.include = 2)
# }
# NOT RUN {
#-------------------code demonstration in doi: 10.18637/jss.v094.i04----------------------#
Tbeta <- rep(0, 20)
Tbeta[c(1, 2, 5, 9)] <- c(3, 1.5, -2, -1)
data <- gen.data(n = 200, p = 20, family = "gaussian", beta = Tbeta,
rho = 0.2, seed = 123)
fit.one <- bess.one(data$x, data$y, s = 4, family = "gaussian")
print(fit.one)
summary(fit.one)
coef(fit.one, sparse = FALSE)
pred.one <- predict(fit.one, newdata = data$x)
bm.one <- fit.one$bestmodel
summary(bm.one)
# }
Run the code above in your browser using DataLab