(Robustly) sequence groups of candidate predictors according to their predictive content and find the optimal model along the sequence.
grplars(x, ...)# S3 method for formula
grplars(formula, data, ...)
# S3 method for data.frame
grplars(x, y, ...)
# S3 method for default
grplars(
x,
y,
sMax = NA,
assign,
fit = TRUE,
s = c(0, sMax),
crit = c("BIC", "PE"),
splits = foldControl(),
cost = rmspe,
costArgs = list(),
selectBest = c("hastie", "min"),
seFactor = 1,
ncores = 1,
cl = NULL,
seed = NULL,
model = TRUE,
...
)
rgrplars(x, ...)
# S3 method for formula
rgrplars(formula, data, ...)
# S3 method for data.frame
rgrplars(x, y, ...)
# S3 method for default
rgrplars(
x,
y,
sMax = NA,
assign,
centerFun = median,
scaleFun = mad,
regFun = lmrob,
regArgs = list(),
combine = c("min", "euclidean", "mahalanobis"),
const = 2,
prob = 0.95,
fit = TRUE,
s = c(0, sMax),
crit = c("BIC", "PE"),
splits = foldControl(),
cost = rtmspe,
costArgs = list(),
selectBest = c("hastie", "min"),
seFactor = 1,
ncores = 1,
cl = NULL,
seed = NULL,
model = TRUE,
...
)
If fit
is FALSE
, an integer vector containing the indices of
the sequenced predictor groups.
Else if crit
is "PE"
, an object of class
"perrySeqModel"
(inheriting from classes "perryTuning"
,
see perryTuning
). It contains information on the
prediction error criterion, and includes the final model as component
finalModel
.
Otherwise an object of class "grplars"
(inheriting from class
"seqModel"
) with the following components:
active
an integer vector containing the sequence of predictor groups.
s
an integer vector containing the steps for which submodels along the sequence have been computed.
coefficients
a numeric matrix in which each column contains the regression coefficients of the corresponding submodel along the sequence.
fitted.values
a numeric matrix in which each column contains the fitted values of the corresponding submodel along the sequence.
residuals
a numeric matrix in which each column contains the residuals of the corresponding submodel along the sequence.
df
an integer vector containing the degrees of freedom of the submodels along the sequence (i.e., the number of estimated coefficients).
robust
a logical indicating whether a robust fit was computed.
scale
a numeric vector giving the robust residual scale estimates for the submodels along the sequence (only returned for a robust fit).
crit
an object of class "bicSelect"
containing the
BIC values and indicating the final model (only returned if argument
crit
is "BIC"
and argument s
indicates more than one
step along the sequence).
muX
a numeric vector containing the center estimates of the predictor variables.
sigmaX
a numeric vector containing the scale estimates of the predictor variables.
muY
numeric; the center estimate of the response.
sigmaY
numeric; the scale estimate of the response.
x
the matrix of candidate predictors (if model
is
TRUE
).
y
the response (if model
is TRUE
).
assign
an integer vector giving the predictor group to which each predictor variable belongs.
w
a numeric vector giving the data cleaning weights (only returned for a robust fit).
call
the matched function call.
a matrix or data frame containing the candidate predictors.
additional arguments to be passed down.
a formula describing the full model.
an optional data frame, list or environment (or object coercible
to a data frame by as.data.frame
) containing the variables in
the model. If not found in data, the variables are taken from
environment(formula)
, typically the environment from which
grplars
or rgrplars
is called.
a numeric vector containing the response.
an integer giving the number of predictor groups to be
sequenced. If it is NA
(the default), predictor groups are sequenced
as long as there are twice as many observations as expected predictor
variables (number of predictor groups times the average number of predictor
variables per group).
an integer vector giving the predictor group to which each predictor variable belongs.
a logical indicating whether to fit submodels along the sequence
(TRUE
, the default) or to simply return the sequence (FALSE
).
an integer vector of length two giving the first and last
step along the sequence for which to compute submodels. The default
is to start with a model containing only an intercept (step 0) and
iteratively add all groups along the sequence (step sMax
). If
the second element is NA
, predictor groups are added to the
model as long as there are twice as many observations as predictor
variables. If only one value is supplied, it is recycled.
a character string specifying the optimality criterion to be
used for selecting the final model. Possible values are "BIC"
for
the Bayes information criterion and "PE"
for resampling-based
prediction error estimation.
an object giving data splits to be used for prediction error
estimation (see perry
).
a cost function measuring prediction loss (see
perry
for some requirements). The
default is to use the root trimmed mean squared prediction error for a
robust fit and the root mean squared prediction error otherwise (see
cost
).
a list of additional arguments to be passed to the
prediction loss function cost
.
arguments specifying a criterion for selecting
the best model (see perrySelect
). The default is to
use a one-standard-error rule.
a positive integer giving the number of processor cores to be
used for parallel computing (the default is 1 for no parallelization). If
this is set to NA
, all available processor cores are used. For
obtaining the data cleaning weights, for fitting models along the sequence
and for prediction error estimation, parallel computing is implemented on
the R level using package parallel. Otherwise parallel computing for
some of of the more computer-intensive computations in the sequencing step
is implemented on the C++ level via OpenMP (https://www.openmp.org/).
a parallel cluster for parallel computing as generated by
makeCluster
. This is preferred over ncores
for tasks that are parallelized on the R level, in which case ncores
is only used for tasks that are parallelized on the C++ level.
optional initial seed for the random number generator (see
.Random.seed
). This is useful because many robust regression
functions (including lmrob
) involve randomness,
or for prediction error estimation. On parallel R worker processes, random
number streams are used and the seed is set via
clusterSetRNGStream
.
a logical indicating whether the model data should be included in the returned object.
a function to compute a robust estimate for the center
(defaults to median
).
a function to compute a robust estimate for the scale
(defaults to mad
).
a function to compute robust linear regressions that can be
interpreted as weighted least squares (defaults to
lmrob
).
a list of arguments to be passed to regFun
.
a character string specifying how to combine the data
cleaning weights from the robust regressions with each predictor group.
Possible values are "min"
for taking the minimum weight for each
observation, "euclidean"
for weights based on Euclidean distances
of the multivariate set of standardized residuals (i.e., multivariate
winsorization of the standardized residuals assuming independence), or
"mahalanobis"
for weights based on Mahalanobis distances of the
multivariate set of standardized residuals (i.e., multivariate winsorization
of the standardized residuals).
numeric; tuning constant for multivariate winsorization to be used in the initial corralation estimates based on adjusted univariate winsorization (defaults to 2).
numeric; probability for the quantile of the \(\chi^{2}\) distribution to be used in multivariate winsorization (defaults to 0.95).
Andreas Alfons
Alfons, A., Croux, C. and Gelper, S. (2016) Robust groupwise least angle regression. Computational Statistics & Data Analysis, 93, 421--435. tools:::Rd_expr_doi("10.1016/j.csda.2015.02.007")
coef
,
fitted
,
plot
,
predict
,
residuals
,
rstandard
,
lmrob
data("TopGear")
# keep complete observations
keep <- complete.cases(TopGear)
TopGear <- TopGear[keep, ]
# remove information on car model
info <- TopGear[, 1:3]
TopGear <- TopGear[, -(1:3)]
# log-transform price
TopGear$Price <- log(TopGear$Price)
# robust groupwise LARS
rgrplars(MPG ~ ., data = TopGear, sMax = 15)
Run the code above in your browser using DataLab