alvm.fit: Auxiliary Linear Variance Model

Description

Fits an Auxiliary Linear Variance Model (ALVM) to estimate the error variances of a heteroskedastic linear regression model.

Usage

alvm.fit(
  mainlm,
  M = NULL,
  model = c("cluster", "spline", "linear", "polynomial", "basic", "homoskedastic"),
  varselect = c("none", "hettest", "cv.linear", "cv.cluster", "qgcv.linear",
    "qgcv.cluster"),
  lambda = c("foldcv", "qgcv"),
  nclust = c("elbow.swd", "elbow.mwd", "elbow.both", "foldcv"),
  clustering = NULL,
  polypen = c("L2", "L1"),
  d = 2L,
  solver = c("auto", "quadprog", "quadprogXT", "roi", "osqp"),
  tsk = NULL,
  tsm = NULL,
  constol = 1e-10,
  cvoption = c("testsetols", "partitionres"),
  nfolds = 5L,
  reduce2homosked = TRUE,
  ...
)

Value

An object of class "alvm.fit", containing the following:

coef.est, a vector of parameter estimates, $\hat{\gamma}$
var.est, a vector of estimates $\hat{\omega}$ of the error variances for all observations
method, a character corresponding to the model argument
ols, the lm object corresponding to the original linear regression model
fitinfo, a list containing four named objects: Msq (the elementwise-square of the annihilator matrix $M$), L (the linear predictor matrix $L$), clustering (a list object with results of the clustering procedure), and gam.object, an object of class "gam" (see gamObject). The last two are set to NA unless the clustering ALVM or thin-plate spline ALVM is used, respectively
hyperpar, a named list of hyperparameter values, lambda, nclust, tsk, and d, and tuning methods, lambdamethod and nclustmethod. Values corresponding to unused hyperparameters are set to NA.
selectinfo, a list containing two named objects, varselect (the value of the eponymous argument), and selectedcols (a numeric vector with column indices of $X$ that were selected, with 1 denoting the intercept column)
pentype, a character corresponding to the polypen argument
solver, a character corresponding to the solver argument (or specifying the QP solver actually used, if solver was set to "auto")
constol, a double corresponding to the constol argument

Arguments

mainlm

Either an object of class "lm" (e.g., generated by lm), or a list of two objects: a response vector and a design matrix. The objects are assumed to be in that order, unless they are given the names "X" and "y" to distinguish them. The design matrix passed in a list must begin with a column of ones if an intercept is to be included in the linear model. The design matrix passed in a list should not contain factors, as all columns are treated 'as is'. For tests that use ordinary least squares residuals, one can also pass a vector of residuals in the list, which should either be the third object or be named "e".

M

An $n\times n$ annihilator matrix. If NULL (the default), this will be calculated from the mainlm object

model

A character corresponding to the type of ALVM to be fitted: "cluster" for the clustering ALVM, "spline" for the thin-plate spline ALVM, "linear" for the linear ALVM, "polynomial" for the penalised polynomial ALVM, "basic" for the basic or naive ALVM, and "homoskedastic" for the homoskedastic error variance estimator, $e'e/(n-p)$.

varselect

Either a character indicating how variable selection should be conducted, or an integer vector giving indices of columns of the predictor matrix (model.matrix of mainlm) to select. The vector must include 1L for the intercept to be selected. If a character, it must be one of the following:

"none": No variable selection is conducted;
"hettest": Variable selection is conducted by applying a heteroskedasticity test with each feature in turn serving as the `deflator' variable
"cv.linear": Variable selection is conducted by best subset selection on the auxiliary linear variance model (linear specification), using squared-error loss computed under $K$-fold cross-validation
"cv.cluster": Variable selection is conducted by best subset selection on the auxiliary linear variance model (clustering specification), using squared-error loss computed under $K$-fold cross-validation
"qgcv.linear": Variable selection is conducted by best subset selection on the auxiliary linear variance model (linear specification), using squared-error loss computed under quasi-generalised cross-validation
"qgcv.cluster": Variable selection is conducted by best subset selection on the auxiliary linear variance model (clustering specification), using squared-error loss computed under quasi-generalised cross-validation

lambda

Either a double of length 1 indicating the value of the penalty hyperparameter $\lambda$, or a character specifying the tuning method for choosing $\lambda$: "foldcv" for $K$-fold cross-validation (the default) or "qgcv" for quasi-generalised cross-validation. This argument is ignored if model is neither "polynomial" nor "spline".

nclust

Either an integer of length 1 indicating the value of the number of clusters $n_c$, or a character specifying the tuning method for choosing $n_c$: "elbow.swd" for the elbow method using a sum of within-cluster distances criterion, "elbow.mwd" for the elbow method using a maximum within-cluster distances criterion, "elbow.both" for rounded average of the results of the previous two, and "foldcv" for $K$-fold cross-validation. This argument is ignored if model is not "cluster".

clustering

A list object of class "doclust". If set to NULL (the default), such an object is generated (ignored if cluster is FALSE)

polypen

A character, either "L2" or "L1", indicating whether an $L_2$ norm penalty (ridge regression) or $L_1$ norm penalty (LASSO) should be used with the polynomial model. This argument is ignored if model is not "polynomial".

d

An integer specifying the degree of polynomial to use in the penalised polynomial ALVM; defaults to 2L. Ignored if model is other than "polynomial". Setting d to 1L is not identical to setting model to "linear", because the linear ALVM does not have a penalty term in the objective function.

solver

A character, indicating which Quadratic Programming solver function to use to estimate $\gamma$. The options are "quadprog", corresponding to solve.QP.compact from the quadprog package; package; "quadprogXT", corresponding to buildQP from the quadprogXT package; "roi", corresponding to the qpoases solver implemented in ROI_solve from the ROI package with ROI.plugin.qpoases add-on; and "osqp", corresponding to solve_osqp from the osqp package. Alternatively, the user can specify "auto" (the default), in which case the function will select the solver that seems to work best for the chosen model.

tsk

An integer corresponding to the basis dimension k to be passed to the [mgcv]{s} function for fitting of a thin-plate spline ALVM; see [mgcv]{choose.k} for more details about choosing the parameter and defaults. Ignored if model is not "spline".

tsm

An integer corresponding to the order m of the penalty to be passed to the [mgcv]{s} function for fitting of a thin-plate spline ALVM. If left as the default (NULL), it will be set to 2, corresponding to 2nd derivative penalties for a cubic spline.

constol

A double corresponding to the boundary value for the constraint on error variances. Of course, the error variances must be non-negative, but setting the constraint boundary to 0 can result in zero estimates that then result in infinite weights for Feasible Weighted Least Squares. The boundary value should thus be positive, but small enough not to bias estimation of very small variances. Defaults to 1e-10.

cvoption

A character, either "testsetols" or "partitionres", indicating how to obtain the observed response values for each test fold when performing $K$-fold cross-validation on an ALVM. The default technique, "testsetols", entails fitting a linear regression model to the test fold of observations from the original response vector $y$ and predictor matrix $X$. The squared residuals from this regression are the observed responses that are predicted from the trained model to compute the cross-validated squared error loss function. Under the other technique, "partitionres", the squared residuals from the full linear regression model are partitioned into training and test folds and the squared residuals in the test fold are the observed responses that are predicted for computation of the cross-validated loss.

nfolds

An integer specifying the number of folds $K$ to use for cross-validation, if the $\lambda$ and/or $n_c$ hyperparameters are to be tuned using cross-validation. Defaults to 5L. One must ensure that each test fold contains at least $p+1$ observations if the "testsetols" technique is used with cross-validation, so that there are enough degrees of freedom to fit a linear model to the test fold.

reduce2homosked

A logical indicating whether the homoskedastic error variance estimator $e'e/(n-p)$ should be used if the variable selection procedure does not select any variables. Defaults to TRUE.

...

Other arguments that can be passed to (non-exported) helper functions, namely:

greedy, a logical passed to the functions implementing best subset selection, indicating whether or not to use a greedy search rather than exhaustive search for the best subset. Defaults to FALSE, but coerced to TRUE unconditionally if $p>9$.
distmetric, a character specifying the distance metric to use in computing distance for the clustering algorithm. Corresponds to the method argument of dist and defaults to "euclidean"
linkage, a character specifying the linkage rule to use in agglomerative hierarchical clustering. Corresponds to the method argument of hclust and defaults to "complete"
nclust2search, an integer vector specifying the values of $n_c$ to try when tuning $n_c$ by cross-validation. Defaults to 1L:20L
alpha, a double specifying the significance level threshold to use when applying heteroskedasticity test for the purpose of feature selection in an ALVM; defaults to 0.1
testname, a character corresponding to the name of a function that performs a heteroskedasticity test. The function must either be one that takes a deflator argument or breusch_pagan. Defaults to evans_king

Details

The ALVM model equation is $$e\circ e = (M \circ M)L \gamma + u$$, where $e$ is the Ordinary Least Squares residual vector, $M$ is the annihilator matrix $M=I-X(X'X)^{-1}X'$, $L$ is a linear predictor matrix, $u$ is a random error vector, $\gamma$ is a $p$-vector of unknown parameters, and $\circ$ denotes the Hadamard (elementwise) product. The construction of $L$ depends on the method used to model or estimate the assumed heteroskedastic function $g(\cdot)$, a continuous, differentiable function that is linear in $\gamma$ and by which the error variances $\omega_i$ of the main linear model are related to the predictors $X_{i\cdot}$. This method has been developed as part of the author's doctoral research project.

Depending on the model used, the estimation method could be Inequality-Constrained Least Squares or Inequality-Constrained Ridge Regression. However, these are both special cases of Quadratic Programming. Therefore, all of the models are fitted using Quadratic Programming.

Several techniques are available for feature selection within the model. The LASSO-type model handles feature selection via a shrinkage penalty. For this reason, if the user calls the polynomial model with $L_1$-norm penalty, it is not necessary to specify a variable selection method, since this is handled automatically. Another feature selection technique is to use a heteroskedasticity test that tests for heteroskedasticity linked to a particular predictor variable (the `deflator'). This test can be conducted with each features in turn serving as the deflator. Those features for which the null hypothesis of homoskedasticity is rejected at a specified significance level alpha are selected. A third feature selection technique is best subset selection, where the model is fitted with all possible subsets of features. The models are scored in terms of some metric, and the best-performing subset of features is selected. The metric could be squared-error loss computed under $K$-fold cross-validation or using quasi-generalised cross-validation. (The quasi- prefix refers to the fact that generalised cross-validation is, properly speaking, only applicable to a linear fitting method, as defined by Hastie09;textualskedastic. ALVMs are not linear fitting methods due to the inequality constraint). Since best subset selection requires fitting $2^{p-1}$ models (where $p-1$ is the number of candidate features), it is infeasible for large $p$. A greedy search technique can therefore be used as an alternative, where one begins with a null model and adds the feature that leads to the best improvement in the metric, stopping when no new feature leads to an improvement.

The polynomial and thin-plate spline ALVMs have a penalty hyperparameter $\lambda$ that must either be specified or tuned. $K$-fold cross-validation or quasi-generalised cross-validation can be used for tuning. The clustering ALVM has a hyperparameter $n_c$, the number of clusters into which to group the observations (where error variances are assumed to be equal within each cluster). $n_c$ can be specified or tuned. The available tuning methods are an elbow method (using a sum of within-cluster distances criterion, a maximum within-cluster distance criterion, or a combination of the two) and $K$-fold cross-validation.

References

Examples

Run this code

mtcars_lm <- lm(mpg ~ wt + qsec + am, data = mtcars)
myalvm <- alvm.fit(mtcars_lm, model = "cluster")

Run the code above in your browser using DataLab