sfpl.kNN.fit: SFPLM regularised fit using kNN estimation

Description

This function fits a sparse semi-functional partial linear model (SFPLM). It employs a penalised least-squares regularisation procedure, integrated with nonparametric kNN estimation using Nadaraya-Watson weights.

The procedure utilises an objective criterion (criterion) to select both the bandwidth (h.opt) and the regularisation parameter (lambda.opt).

Usage

sfpl.kNN.fit(x, z, y, semimetric = "deriv", q = NULL, knearest = NULL,
min.knn = 2, max.knn = NULL, step = NULL, range.grid = NULL, 
kind.of.kernel = "quad", nknot = NULL, lambda.min = NULL, lambda.min.h = NULL, 
lambda.min.l = NULL, factor.pn = 1, nlambda = 100, lambda.seq = NULL, 
vn = ncol(z), nfolds = 10, seed = 123, criterion = "GCV", penalty = "grSCAD", 
max.iter = 1000)

Value

call: The matched call.
fitted.values: Estimated scalar response.
residuals: Differences between y and the fitted.values
beta.est: Estimate of $\beta_0$ when the optimal tuning parameters lambda.opt, k.opt and vn.opt are used.
indexes.beta.nonnull: Indexes of the non-zero $\hat{\beta_{j}}$.
k.opt: Selected number of nearest neighbours.
lambda.opt: Selected value of lambda.
IC: Value of the criterion function considered to select both lambda.opt, h.opt and vn.opt.
vn.opt: Selected value of vn.
...

Arguments

x: Matrix containing the observations of the functional covariate (functional nonparametric component), collected by row.
z: Matrix containing the observations of the scalar covariates (linear component), collected by row.
y: Vector containing the scalar response.
semimetric: Semi-metric function. Only "deriv" and "pca" are implemented. By default semimetric="deriv".
q: Order of the derivative (if semimetric="deriv") or number of principal components (if semimetric="pca"). The default values are 0 and 2, respectively.
knearest: Vector of positive integers containing the sequence in which the number of nearest neighbours k.opt is selected. If knearest=NULL, then knearest <- seq(from =min.knn, to = max.knn, by = step).
min.knn: A positive integer that represents the minimum value in the sequence for selecting the number of nearest neighbours k.opt. This value should be less than the sample size. The default is 2.
max.knn: A positive integer that represents the maximum value in the sequence for selecting number of nearest neighbours k.opt. This value should be less than the sample size. The default is max.knn <- n%/%5.
step: A positive integer used to construct the sequence of k-nearest neighbours as follows: min.knn, min.knn + step, min.knn + 2*step, min.knn + 3*step,.... The default value for step is step<-ceiling(n/100).
range.grid: Vector of length 2 containing the endpoints of the grid at which the observations of the functional covariate x are evaluated (i.e. the range of the discretisation). If range.grid=NULL, then range.grid=c(1,p) is considered, where p is the discretisation size of x (i.e. ncol(x)).
kind.of.kernel: The type of kernel function used. Currently, only Epanechnikov kernel ("quad") is available.
nknot: Positive integer indicating the number of interior knots for the B-spline expansion of the functional covariate. The default value is (p - order.Bspline - 1)%/%2.
lambda.min: The smallest value for lambda (i.e. the lower endpoint of the sequence in which lambda.opt is selected), as fraction of lambda.max. The defaults is lambda.min.l if the sample size is larger than factor.pn times the number of linear covariates and lambda.min.h otherwise.
lambda.min.h: The lower endpoint of the sequence in which lambda.opt is selected if the sample size is smaller than factor.pn times the number of linear covariates. The default is 0.05.
lambda.min.l: The lower endpoint of the sequence in which lambda.opt is selected if the sample size is larger than factor.pn times the number of linear covariates. The default is 0.0001.
factor.pn: Positive integer used to set lambda.min. The default value is 1.
nlambda: Positive integer indicating the number of values in the sequence from which lambda.opt is selected. The default is 100.
lambda.seq: Sequence of values in which lambda.opt is selected. If lambda.seq=NULL, then the programme builds the sequence automatically using lambda.min and nlambda.
vn: Positive integer or vector of positive integers indicating the number of groups of consecutive variables to be penalised together. The default value is vn=ncol(z), resulting in the individual penalization of each scalar covariate.
nfolds: Number of cross-validation folds (used when criterion="k-fold-CV"). Default is 10.
seed: You may set the seed for the random number generator to ensure reproducible results (applicable when criterion="k-fold-CV" is used). The default seed value is 123.
criterion: The criterion used to select the tuning and regularisation parameter: k.opt and lambda.opt (also vn.opt if needed). Options include "GCV", "BIC", "AIC", or "k-fold-CV". The default setting is "GCV".
penalty: The penalty function applied in the penalised least-squares procedure. Currently, only "grLasso" and "grSCAD" are implemented. The default is "grSCAD".
max.iter: Maximum number of iterations allowed across the entire path. The default value is 1000.

Author

German Aneiros Perez german.aneiros@udc.es

Silvia Novo Diaz snovo@est-econ.uc3m.es

Details

The sparse semi-functional partial linear model (SFPLM) is given by the expression: $$ Y_i = Z_{i1}\beta_{01} + \dots + Z_{ip_n}\beta_{0p_n} + m(X_i) + \varepsilon_i,\ \ \ i = 1, \dots, n, $$ where $Y_i$ denotes a scalar response, $Z_{i1}, \dots, Z_{ip_n}$ are real random covariates, and $X_i$ is a functional random covariate valued in a semi-metric space $\mathcal{H}$. In this equation, $\mathbf{\beta}_0 = (\beta_{01}, \dots, \beta_{0p_n})^{\top}$ and $m(\cdot)$ represent a vector of unknown real parameters and an unknown smooth real-valued function, respectively. Additionally, $\varepsilon_i$ is the random error.

In this function, the SFPLM is fitted using a penalised least-squares approach. The approach involves transforming the SFPLM into a linear model by extracting from $Y_i$ and $Z_{ij}$ ($j = 1, \ldots, p_n$) the effect of the functional covariate $X_i$ using functional nonparametric regression (for details, see Ferraty and Vieu, 2006). This transformation is achieved using kNN estimation with Nadaraya-Watson weights.

An approximate linear model is then obtained: $$\widetilde{\mathbf{Y}}\approx\widetilde{\mathbf{Z}}\mathbf{\beta}_0+\mathbf{\varepsilon},$$ and the penalised least-squares procedure is applied to this model by minimising $$ \mathcal{Q}\left(\mathbf{\beta}\right)=\frac{1}{2}\left(\widetilde{\mathbf{Y}}-\widetilde{\mathbf{Z}}\mathbf{\beta}\right)^{\top}\left(\widetilde{\mathbf{Y}}-\widetilde{\mathbf{Z}}\mathbf{\beta}\right)+n\sum_{j=1}^{p_n}\mathcal{P}_{\lambda_{j_n}}\left(|\beta_j|\right), \quad (1) $$ where $\mathbf{\beta} = (\beta_1, \ldots, \beta_{p_n})^{\top}, \ \mathcal{P}_{\lambda_{j_n}}(\cdot)$ is a penalty function (specified in the argument penalty) and $\lambda_{j_n} > 0$ is a tuning parameter. To reduce the number of tuning parameters, $\lambda_j$, to be selected for each sample, we consider $\lambda_j = \lambda \widehat{\sigma}_{\beta_{0,j,OLS}}$, where $\beta_{0,j,OLS}$ denotes the OLS estimate of $\beta_{0,j}$ and $\widehat{\sigma}_{\beta_{0,j,OLS}}$ is the estimated standard deviation. Both $\lambda$ and $k$ (in the kNN estimation) are selected using the objective criterion specified in the argument criterion.

Finally, after estimating $\mathbf{\beta}_0$ by minimising (1), we address the estimation of the nonlinear function $m(\cdot)$. For this, we again employ the kNN procedure with Nadaraya-Watson weights to smooth the partial residuals $Y_i - \mathbf{Z}_i^{\top}\widehat{\mathbf{\beta}}$.

For further details on the estimation procedure of the sparse SFPLM, see Aneiros et al. (2015).

Remark: It should be noted that if we set lambda.seq to $0$, we can obtain the non-penalised estimation of the model, i.e. the OLS estimation. Using lambda.seq with a value $\not= 0$ is advisable when suspecting the presence of irrelevant variables.

References

Aneiros, G., Ferraty, F., Vieu, P. (2015) Variable selection in partial linear regression with functional covariate. Statistics, 49, 1322--1347, tools:::Rd_expr_doi("https://doi.org/10.1080/02331888.2014.998675").

Examples

Run this code

data("Tecator")
y<-Tecator$fat
X<-Tecator$absor.spectra
z1<-Tecator$protein       
z2<-Tecator$moisture

#Quadratic, cubic and interaction effects of the scalar covariates.
z.com<-cbind(z1,z2,z1^2,z2^2,z1^3,z2^3,z1*z2)
train<-1:160

#SFPLM fit. 
ptm=proc.time()
fit<-sfpl.kNN.fit(y=y[train],x=X[train,], z=z.com[train,],q=2, max.knn=20,
  lambda.min.l=0.01, criterion="BIC",
  range.grid=c(850,1050), nknot=20, max.iter=5000)
proc.time()-ptm

#Results
fit
names(fit)

Run the code above in your browser using DataLab