grp.criValues(X, y, group, criterion = c("gSIS", "gHOLP", "gAR2", "gDC"),
family = c("gaussian", "binomial", "poisson"), scale = c("standardize",
"normalize", "none"), norm = c("L1", "L2", "Linf"))
gSIS
.gaussian
.standardize
".gSIS
" or "gHOLP
" screening criterion.
See norm_vec
for details. The default is L1
.In greater details, let $X = (x_{11},x_{12},...,x_{1p_1},...,x_{j1},x_{j2},..., x_{jp_j},...,x_{J1},x_{J2},...,x_{Jp_J})$ be the grouped predictors, where $J$ is the number of groups and $p_j$ is the number of predictors in the $j$-th group.
For the case in which
family = "gaussian"
, four approaches are applied to calculate
such criterion values.
The first criterion is "gSIS
" that is the grouped version of sure
independence screening [SIS, Fan and Lv (2008)] and defined as
$$\hat{w} = X^{T}y = (w_{11},w_{12},...,w_{1p_1},...,w_{j1},w_{j2},...,
w_{jp_j},...,w_{J1},w_{J2},...,w_{Jp_J}).$$
Then we take the norm of the vector $(w_{j1},w_{j2},...,
w_{jp_j})$ from the $j$-th group divided by its size $p_j$, defined as $W_j$
and thus we obtain the criterion values for the whole groups defined as
$$\hat{W} = (W_1,...,W_J).$$ The details of norm
type can be seen in
norm_vec
.
The second criterion is "gHOLP
" that is a grouped version of High-dimensional
Ordinary Least-squares Projector [HOLP, Wang and Leng (2015)] and defined as
$$\hat{\beta} = X^{T}(XX^{T})^{-1}y = (\beta_{11},\beta_{12},...,\beta_{1p_1},...,
\beta_{j1},\beta_{j2},...,\beta_{jp_j},...,\beta_{J1},\beta_{J2},...,\beta_{Jp_J})$$
and then we proceed the same way as "gSIS
" to incorporate the group structure.
The third criterion is "gAR2
" which is called groupwise adjusted r.squared. The
basic idea is that we fit a linear model for each group separately and compute the
adjusted r.squared that measures the correlation between each group and response. Note
that in order to calculate the adjusted r.squared, the maximum group size
$\max(p_j),j=1,...,J$ should not be larger than sample size $n$.
The last criterion is "gDC
" which is called grouped distance correlation.
The distance correlation [Szekely, Rizzo and Bakirov (2007)] measures the dependence
between two random variables or two random vectors.
Thus, similar to the idea of "gAR2
", we compute the distance correlation between
each group and response. It is worthwhile pointing out that distance correlation can not only
measure the linear relationship, but also nonlinear relationship. However, it may take
longer time in computation due to the three steps of calculating distance correlation.
The distance correlation has been applied to screen the individual variables
as in Li, Zhong and Zhu (2012).
For the case in which family = "binomial"
and family = "poisson"
, a different
screening criterion is used for computing the relationship between response and
predictors in each group. To measure the strength of relationship between predictors and
response, the Akaike's Information Criterion (AIC) is utilized and defined as
$$AIC = -2*LogLikelihood + 2*npar,$$ where $LogLikelihood$ is the log-likelihood for
a fitted generalized linear model, and $npar$ is the number of parameters in the
fitted model. In this case, $npar$ is the number of variables within each group,
i.e., $npar = p_j, j = 1,...,J$.
Note that the individual "SIS", "HOLP" can be regarded as a special case of "gSIS
",
and "gHOLP
" when each group has only one predictor.
Li, R., Zhong,W., and Zhu, L. (2012). Feature screening via distance correlation learning. Journal of American Statistical Association, 107, 1129-1139.
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007), Measuring and Testing Dependence by Correlation of Distances, Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
Wang, X. and Leng, C. (2015). High-dimensional Ordinary Least-squares Projector for screening variables.Journal of the Royal Statistical Society: Series B. To appear.
grpss
library(MASS)
n <- 30 # sample size
p <- 3 # number of predictors in each group
J <- 50 # number of groups
group <- rep(1:J,each = 3) # group indices
Sigma <- diag(p*J) # covariance matrix
X <- mvrnorm(n,seq(0,5,length.out = p*J),Sigma)
beta <- runif(12,-2,5) # coefficients
y <- X%*%matrix(c(beta,rep(0,p*J-12)),ncol = 1) + rnorm(n)
grp.criValues(X,y,group) # gSIS
grp.criValues(X,y,group,"gHOLP") # gHOLP
grp.criValues(X,y,group,"gAR2") # gAR2
grp.criValues(X,y,group,"gDC") # gDC
Run the code above in your browser using DataLab