grp.criValues: Compute group screening criterion values

Description

Computes the screening criterion values for each group.

Usage

grp.criValues(X, y, group, criterion = c("gSIS", "gHOLP", "gAR2", "gDC"),
  family = c("gaussian", "binomial", "poisson"), scale = c("standardize",
  "normalize", "none"), norm = c("L1", "L2", "Linf"))

Arguments

A matrix of grouped predictors.

A numeric vector of response.

group

A vector of group indices for each predictor. Numeric and consecutive group indices are recommended.

criterion

The group screening criterion. The default is gSIS.

family

A description of the error distribution and link function to be used in the model. The default is gaussian.

scale

The type of scaling of the predictors. The default is "standardize".

norm

The type of norm for "gSIS" or "gHOLP" screening criterion. See norm_vec for details. The default is L1.

Value

A numeric matrix with two columns: the first column is the group index, and the second column is the grouped screening criterion values corresponding to the first column.

Details

In the group screening procedure, we first have to calculate the values which measure the strength of relationship between entire predictors of each group and response. These values can be used to screen out the important grouped variables (equivalently, remove the unimportant grouped variables) so that we can reduce the dimension of data from high or ultra-high to moderate or even small one.

In greater details, let $X = (x_{11},x_{12},...,x_{1p_1},...,x_{j1},x_{j2},..., x_{jp_j},...,x_{J1},x_{J2},...,x_{Jp_J})$ be the grouped predictors, where $J$ is the number of groups and $p_j$ is the number of predictors in the $j$-th group.

For the case in which family = "gaussian", four approaches are applied to calculate such criterion values.

The first criterion is "gSIS" that is the grouped version of sure independence screening [SIS, Fan and Lv (2008)] and defined as $$\hat{w} = X^{T}y = (w_{11},w_{12},...,w_{1p_1},...,w_{j1},w_{j2},..., w_{jp_j},...,w_{J1},w_{J2},...,w_{Jp_J}).$$ Then we take the norm of the vector $(w_{j1},w_{j2},..., w_{jp_j})$ from the $j$-th group divided by its size $p_j$, defined as $W_j$ and thus we obtain the criterion values for the whole groups defined as $$\hat{W} = (W_1,...,W_J).$$ The details of norm type can be seen in norm_vec.

The second criterion is "gHOLP" that is a grouped version of High-dimensional Ordinary Least-squares Projector [HOLP, Wang and Leng (2015)] and defined as $$\hat{\beta} = X^{T}(XX^{T})^{-1}y = (\beta_{11},\beta_{12},...,\beta_{1p_1},..., \beta_{j1},\beta_{j2},...,\beta_{jp_j},...,\beta_{J1},\beta_{J2},...,\beta_{Jp_J})$$ and then we proceed the same way as "gSIS" to incorporate the group structure.

The third criterion is "gAR2" which is called groupwise adjusted r.squared. The basic idea is that we fit a linear model for each group separately and compute the adjusted r.squared that measures the correlation between each group and response. Note that in order to calculate the adjusted r.squared, the maximum group size $\max(p_j),j=1,...,J$ should not be larger than sample size $n$.

The last criterion is "gDC" which is called grouped distance correlation. The distance correlation [Szekely, Rizzo and Bakirov (2007)] measures the dependence between two random variables or two random vectors. Thus, similar to the idea of "gAR2", we compute the distance correlation between each group and response. It is worthwhile pointing out that distance correlation can not only measure the linear relationship, but also nonlinear relationship. However, it may take longer time in computation due to the three steps of calculating distance correlation. The distance correlation has been applied to screen the individual variables as in Li, Zhong and Zhu (2012).

For the case in which family = "binomial" and family = "poisson", a different screening criterion is used for computing the relationship between response and predictors in each group. To measure the strength of relationship between predictors and response, the Akaike's Information Criterion (AIC) is utilized and defined as $$AIC = -2*LogLikelihood + 2*npar,$$ where $LogLikelihood$ is the log-likelihood for a fitted generalized linear model, and $npar$ is the number of parameters in the fitted model. In this case, $npar$ is the number of variables within each group, i.e., $npar = p_j, j = 1,...,J$.

Note that the individual "SIS", "HOLP" can be regarded as a special case of "gSIS", and "gHOLP" when each group has only one predictor.

References

Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society B, 70, 849-911.

Li, R., Zhong,W., and Zhu, L. (2012). Feature screening via distance correlation learning. Journal of American Statistical Association, 107, 1129-1139.

Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007), Measuring and Testing Dependence by Correlation of Distances, Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.

Wang, X. and Leng, C. (2015). High-dimensional Ordinary Least-squares Projector for screening variables.Journal of the Royal Statistical Society: Series B. To appear.

Examples

Run this code

library(MASS)
n <- 30 # sample size
p <- 3  # number of predictors in each group
J <- 50 # number of groups
group <- rep(1:J,each = 3)  # group indices
Sigma <- diag(p*J)  # covariance matrix
X <- mvrnorm(n,seq(0,5,length.out = p*J),Sigma)
beta <- runif(12,-2,5)  # coefficients
y <- X%*%matrix(c(beta,rep(0,p*J-12)),ncol = 1) + rnorm(n)

grp.criValues(X,y,group)  # gSIS
grp.criValues(X,y,group,"gHOLP")  # gHOLP
grp.criValues(X,y,group,"gAR2")   # gAR2
grp.criValues(X,y,group,"gDC")    # gDC

Run the code above in your browser using DataLab