ltsReg: Least Trimmed Squares Robust (High Breakdown) Regression

Description

Carries out least trimmed squares (LTS) robust (high breakdown point) regression.

Usage

ltsReg(x, ...)
"ltsReg"(formula, data, subset, weights, na.action, model = TRUE, x.ret = FALSE, y.ret = FALSE, contrasts = NULL, offset, ...)
"ltsReg"(x, y, intercept = TRUE, alpha = , nsamp = , adjust = , mcd = TRUE, qr.out = FALSE, yname = NULL, seed = , trace = , use.correction = , wgtFUN = , control = rrcov.control(), ...)

Arguments

formula

a formula of the form y ~ x1 + x2 + ....

data

data frame from which variables specified in formula are to be taken.

subset

an optional vector specifying a subset of observations to be used in the fitting process.

weights

an optional vector of weights to be used in the fitting process. NOT USED YET.

na.action

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The “factory-fresh” default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful.

model, x.ret, y.ret

logicals indicating if the model frame, the model matrix and the response are to be returned, respectively.

contrasts

an optional list. See the contrasts.arg of model.matrix.default.

offset

this can be used to specify an a priori known component to be included in the linear predictor during fitting. An offset term can be included in the formula instead or as well, and if both are specified their sum is used.

a matrix or data frame containing the explanatory variables.

the response: a vector of length the number of rows of x.

intercept

if true, a model with constant term will be estimated; otherwise no constant term will be included. Default is intercept = TRUE

alpha

the percentage (roughly) of squared residuals whose sum will be minimized, by default 0.5. In general, alpha must between 0.5 and 1.

nsamp

number of subsets used for initial estimates or "best" or "exact". Default is nsamp = 500. For nsamp="best" exhaustive enumeration is done, as long as the number of trials does not exceed 5000. For "exact", exhaustive enumeration will be attempted however many samples are needed. In this case a warning message will be displayed saying that the computation can take a very long time.

adjust

whether to perform intercept adjustment at each step. Since this can be time consuming, the default is adjust = FALSE.

mcd

whether to compute robust distances using Fast-MCD.

qr.out

whether to return the QR decomposition (see qr); defaults to false.

yname

the name of the dependent variable. Default is yname = NULL

seed

initial seed for random generator, like .Random.seed, see rrcov.control.

trace

logical (or integer) indicating if intermediate results should be printed; defaults to FALSE; values $>= 2$ also produce print from the internal (Fortran) code.

use.correction

whether to use finite sample correction factors. Default is use.correction=TRUE

wgtFUN

a character string or function, specifying how the weights for the reweighting step should be computed. Up to April 2013, the only option has been the original proposal in (1999), now specified by wgtFUN = "01.original" (or via control).

control

a list with estimation options - same as these provided in the function specification. If the control object is supplied, the parameters from it will be used. If parameters are passed also in the invocation statement, they will override the corresponding elements of the control object.

...

arguments passed to or from other methods.

Value

crit: the value of the objective function of the LTS regression method, i.e., the sum of the $h$ smallest squared raw residuals.
coefficients: vector of coefficient estimates (including the intercept by default when intercept=TRUE), obtained after reweighting.
best: the best subset found and used for computing the raw estimates, with length(best) == quan = h.alpha.n(alpha,n,p).
fitted.values: vector like y containing the fitted values of the response after reweighting.
residuals: vector like y containing the residuals from the weighted least squares regression.
scale: scale estimate of the reweighted residuals.
alpha: same as the input parameter alpha.
quan: the number $h$ of observations which have determined the least trimmed squares estimator.
intercept: same as the input parameter intercept.
cnp2: a vector of length two containing the consistency correction factor and the finite sample correction factor of the final estimate of the error scale.
raw.coefficients: vector of raw coefficient estimates (including the intercept, when intercept=TRUE).
raw.scale: scale estimate of the raw residuals.
raw.resid: vector like y containing the raw residuals from the regression.
raw.cnp2: a vector of length two containing the consistency correction factor and the finite sample correction factor of the raw estimate of the error scale.
lts.wt: vector like y containing weights that can be used in a weighted least squares. These weights are 1 for points with reasonably small residuals, and 0 for points with large residuals.
raw.weights: vector containing the raw weights based on the raw residuals and raw scale.
method: character string naming the method (Least Trimmed Squares).
X: the input data as a matrix (including intercept column if applicable).
Y: the response variable as a vector.

Details

The LTS regression method minimizes the sum of the $h$ smallest squared residuals, where $h > n/2$, i.e. at least half the number of observations must be used. The default value of $h$ (when alpha=1/2) is roughly $n / 2$, more precisely, (n+p+1) %/% 2 where $n$ is the total number of observations, but by setting alpha, the user may choose higher values up to n, where $h = h(\alpha,n,p) =$ h.alpha.n(alpha,n,p). The LTS estimate of the error scale is given by the minimum of the objective function multiplied by a consistency factor and a finite sample correction factor -- see Pison et al. (2002) for details. The rescaling factors for the raw and final estimates are returned also in the vectors raw.cnp2 and cnp2 of length 2 respectively. The finite sample corrections can be suppressed by setting use.correction=FALSE. The computations are performed using the Fast LTS algorithm proposed by Rousseeuw and Van Driessen (1999).

As always, the formula interface has an implied intercept term which can be removed either by y ~ x - 1 or y ~ 0 + x. See formula for more details.

References

Peter J. Rousseeuw (1984), Least Median of Squares Regression. Journal of the American Statistical Association 79, 871--881.

P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection. Wiley.

P. J. Rousseeuw and K. van Driessen (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212--223.

Pison, G., Van Aelst, S., and Willems, G. (2002) Small Sample Corrections for LTS and MCD. Metrika 55, 111-123.

Examples

Run this code

data(heart)
## Default method works with 'x'-matrix and y-var:
heart.x <- data.matrix(heart[, 1:2]) # the X-variables
heart.y <- heart[,"clength"]
ltsReg(heart.x, heart.y)

data(stackloss)
ltsReg(stack.loss ~ ., data = stackloss)

Run the code above in your browser using DataLab