isbfReg: Iterative Selection of Blocks of Features in Regression Estimation - isbfReg

Description

isbfReg performs regression estimation in the model Y = Xb + e where the unknown parameter b is sparse, or sparse and constant by blocks. Y is a vector of size n, X a (n,p) matrix, b a vector of size p and e is the noise.

When b is sparse, one can basically use isbfReg(X,Y), the method used is the Iterative Feature Selection procedure of Alquier (2008). When b is sparse and constant by blocks, one can use isbfReg(X,Y,K=...) where K is the expected maximal size for a block. The method used is Iterative Selection of Blocks of Features procedure of Alquier (2010). Of course, one can always set K=p, but be careful, the computation time and the memory used is directly proportional to p*K.

Usage

isbfReg(X, Y, epsilon = 0.05, K = 1, impmin = 1/100, favgroups = 0,
centX = TRUE, centY = TRUE, s = NULL, v = NULL)

Arguments

The data: the matrix of inputs. Size (n,p).

The data: the vector of outputs. Size n.

epsilon

The confidence level. The theoretical guarantees in Alquier (2010) is that each iteration of the ISBF procedure gets closer to the real parameter b with probability at least 1-epsilon. When epsilon is very small, the procedure becomes very conservative. When epsilon is too large, there is a risk of overfitting. If not specified, epsilon = 5%.

The maximal length of blocks checked in the iterations. If not specified, K=1, this means we seek for a sparse (not constant by block) parameter b, as in Alquier (2008). One should take a larger K is b is really expected to be constant by blocks. If p is quite small (up to 1000), K=p is a reasonnable choice. For larger values of p, please take into account that the computation time and the memory used is directly proportional to p*K.

impmin

Criterion for the end of the iterations. When no more iteration can provide an improvement of Xb larger than impmin, the algorithm stops. If not speficied, impmin=1/100.

favgroups

In case of noisy input data, one may want to favor larger groups in order to stabilize estimation. By default, this variable is taken to 0, but take it larger for noisy input data.

centX

If TRUE, the function centers the variables in X before processing.

centY

If TRUE, the function centers the variable Y before processing.

The threshold used in the iterations. If not specified, the theoretical value of Alquier (2010) is used: s = sqrt(2*v*log(p*K/epsilon)).

The variance of e, if it is known. If not specified, this parameter is VERY roughly estimated by var(Y)/2.

Value

beta: The estimated parameter b.
s: The value of s.
impmin: The value of impmin.
K: The value of K.

References

P. Alquier, An Algorithm for Iterative Selection of Blocks of Features, Proceedings of ALT'10, 2010, M. Hutter, F. Stephan, V. Vovk and T. Zeugmann Eds., Lecture Notes in Artificial Intelligence, pp. 35-49, Springer.

P. Alquier, Iterative Feature Selection in Least Square Regression Estimation, Annales de l'IHP, B (Proba. Stat.), 2008, vol. 44, no. 1, pp 47-88.

Examples

Run this code

# generating data
X = matrix(data=rnorm(5000),nr=50,nc=100)
b = c(rep(0,50),rep(-3,30),rep(0,20))
e = rnorm(50,0,0.3)
Y = X%*%b + e

# call of isbfReg
A = isbfReg(X,Y,K=100,v=0.3)

# visualization of the results
plot(b)
lines(A$beta,col="red")

Run the code above in your browser using DataLab