mht: Multiple testing procedure for non-ordered variable selection

Description

Performs multiple hypotheses testing in a linear model

Usage

mht(data,Y,var_nonselect,alpha,sigma,maxordre,ordre,m,show,IT,maxq)

Arguments

data

Input matrix of dimension n * p; each of the n rows is an observation vector of p variables. The intercept should be included in the first column as (1,...,1). If not, it is added.

Response variable of length n.

var_nonselect

Number of variables that don't undergo feature selection. They have to be in the first columns of data. Default is 1, the selection is not performed on the intercept.

alpha

A user supplied type I error sequence. Default is (0.1,0.05).

sigma

Value of the variance if it is known; 0 otherwise. Default is 0.

maxordre

Number of variables to be ordered. Default is min(n/2-1,p/2-1).

ordre

Several possible algorithms to order the variables, ordre=c("bolasso","pval","pval_hd","FR"). "bolasso" uses the dyadic algorithm with the Bolasso technique dyadiqueordre, "pval" uses the p-values obtained with a regression on the full set of variables (only when p

Number of bootstrap iteration of the Lasso. Only used if the algorithm is set to "bolasso". Default is m=100.

show

Vector of logical values, show=(showordre,showtest,showresult). Default is (1,0,1). If showordre==TRUE, show the ordered variables at each step of the algorithm. If showtest==TRUE, show the number of regularization parameters tested to show the advancement of the dyadic algorithm. Only use if the algorithm is set to "bolasso". if showresult==TRUE, show the value of the statistics and the estimated quantile at each step of the procedure.

Number of simulations for the calculation of the quantile. Default is 1000.

maxq

Number of maximum multiple hypotheses testing to perform. Default is log(min(n,p)-1,2).

Value

data

refitpredictplotA list containing:

Y - the input response vector
means.X - Vector of means of the input data matrix.
sigma.X - Vector of variances of the input data matrix.

coefficients

Matrix of the estimated coefficients. Each row concerns a specific user level alpha.

residuals

Matrix of the residuals. Each row concerns a specific user level alpha.

relevant_var

Set of the relevant variables. Each row concerns a specific user level alpha

fitted.values

Matrix of the fitted values, each column concerns a specific user level alpha.

ordre

Order obtained on the maxordre variables.

ordrebeta

The full order on all the variables.

kchap

Vector containing the length of the estimated set of relevant variables, for each values of alpha.

quantile

The estimated quantiles used in the second step of the procedure.

call

The call that produced this object.

Details

mht is a two-step procedure that performs variable selection in high dimensional linear model. The first step orders the variables taking into account the vector of observations Y. The second step finds a cut-off between the relevant variables (high rank) and the irrelevant ones (low rank) through multiple hypotheses testing. The input maxordre is not to be forgotten: the more variables to order, the more difficult for the algorithm to distinguish which noisy variable is more important that another noisy variable. It is advised to limit maxordre to p/2 or n/2 if they are large. The parameter maxq can be useful for large value of n, it is advised to limit it to 5-6 in order to minimize computational time (for the calculation of the quantile).

References

Multiple hypotheses testing for variable selection; F. Rohart 2011

Examples

Run this code

## Not run: 
# x=matrix(rnorm(100*20),100,20)
# beta=c(rep(2,5),rep(0,15))
# y=x%*%beta+rnorm(100)
# 
# mod=mht(x,y,alpha=c(0.1,0.05),maxordre=15)
# mod
# ## End(Not run)

Run the code above in your browser using DataLab