Learn R Programming

robustX (version 1.2-7)

BACON: BACON for Regression or Multivariate Covariance Estimation

Description

BACON, short for ‘Blocked Adaptive Computationally-Efficient Outlier Nominators’, is a somewhat robust algorithm (set), with an implementation for regression or multivariate covariance estimation.

BACON() applies the multivariate (covariance estimation) algorithm, using mvBACON(x) in any case, and when y is not NULL adds a regression iteration phase, using the auxiliary .lmBACON() function.

Usage

BACON(x, y = NULL, intercept = TRUE,
      m = min(collect * p, n * 0.5),
      init.sel = c("Mahalanobis", "dUniMedian", "random", "manual", "V2"),
      man.sel, init.fraction = 0, collect = 4,
      alpha = 0.05, alphaLM = alpha, maxsteps = 100, verbose = TRUE)

## *Auxiliary* function: .lmBACON(x, y, intercept = TRUE, init.dis, init.fraction = 0, collect = 4, alpha = 0.05, maxsteps = 100, verbose = TRUE)

Value

BACON(x,y,..) (for regression) returns a list with components

subset

the observation indices (in 1:n) denoting a subset of “good” supposedly outlier-free observations.

tis

the \(t_i(y_m, X_m)\) of eq (6) in the reference; the clean “basic subset” in the algorithm is defined the observations \(i\) with the smallest \(|t_i|\), and the \(t_i\) can be regarded as scaled predicted errors.

mv.dis

the (final) discrepancies or distances of mvBACON().

mv.subset

the “good” subset from mvBACON(), used to start the regression iterations.

Arguments

x

a multivariate matrix of dimension [n x p] considered as containing no missing values.

y

the response (n vector) in the case of regression, or NULL for the multivariate case, where just mvBACON() is returned.

intercept

logical indicating if an intercept has to be used for the regression.

m

integer in 1:n specifying the size of the initial basic subset; used only when init.sel is not "manual"; see mvBACON.

init.sel

character string, specifying the initial selection mode; see mvBACON.

man.sel

only when init.sel == "manual", the indices of observations determining the initial basic subset (and m <- length(man.sel)).

init.dis

the distances of the x matrix used for the initial subset determined by mvBACON.

init.fraction

if this parameter is > 0 then the tedious steps of selecting the initial subset are skipped and an initial subset of size n * init.fraction is chosen (with smallest dis)

collect

numeric factor chosen by the user to define the size of the initial subset (p * collect)

alpha

number in \((0, 1)\) determining the cutoff value for the Mahalanobis distances (multivariate outlier nomination in mvBACON()), or the discrepancies for regression, see alphaLM.

alphaLM

number in \((0, 1)\) where a 1-alphaM t-quantile is the cutoff for the discrepancies (for regression, .lmBACON()); see details.

maxsteps

the maximal number of iteration steps (to prevent infinite loops)

verbose

logical indicating if messages are printed which trace progress of the algorithm.

Author

Ueli Oetliker, Swiss Federal Statistical Office, for S-plus 5.1; 25.05.2001; modified six times till 17.6.2001.

Port to R, testing etc, by Martin Maechler. Daniel Weeks (at pitt.edu) proposed a fix to a long standing buglet in GiveTis() computing the \(t_i\), which was further improved Maechler, for robustX version 1.2-3 (Feb. 2019).

Correction of alpha default, from 0.95 to 0.05, by Tobias Schoch, see mvBACON.

Details

Notably about the initial selection mode, init.sel, see its description in the mvBACON arguments list.

The choice of alpha and alphaLM:

  • Multivariate outlier nomination: see the Details section of mvBACON.

  • Regression: Let \(t_r(\alpha)\) denote the \(1-\alpha\) quantile of the Student \(t\)-distribution with \(r\) degrees of freedom, where \(r\) is the number of elements in the current subset; e.g., \(t_r(0.05)\) is the 0.95 quantile. Following Billor et al. (2000), the cutoff value for the discrepancies is defined as \(t_r(\alpha/(2r + 2))\), and they use \(\alpha=0.05\). Note that this is argument alphaLM (defualting to alpha) for BACON().

References

Billor, N., Hadi, A. S., and Velleman , P. F. (2000). BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators; Computational Statistics and Data Analysis 34, 279--298. tools:::Rd_expr_doi("10.1016/S0167-9473(99)00101-2")

See Also

mvBACON, the multivariate version of the BACON algorithm.

Examples

Run this code
data(starsCYG, package = "robustbase")
## Plot simple data and fitted lines
plot(starsCYG)
lmST <- lm(log.light ~ log.Te, data = starsCYG)
abline(lmST, col = "gray") # least squares line
str(B.ST <- with(starsCYG,  BACON(x = log.Te, y = log.light)))
## 'subset': A good set of of points (to determine regression):
colB <- adjustcolor(2, 1/2)
points(log.light ~ log.Te, data = starsCYG, subset = B.ST$subset,
       pch = 19, cex = 1.5, col = colB)
## A BACON-derived line:
lmB <- lm(log.light ~ log.Te, data = starsCYG, subset = B.ST$subset)
abline(lmB, col = colB, lwd = 2)

require(robustbase)
(RlmST <- lmrob(log.light ~ log.Te, data = starsCYG))
abline(RlmST, col = "blue")

Run the code above in your browser using DataLab