Clean data by means of winsorization, i.e., by shrinking outlying observations to the border of the main part of the data.
winsorize(x, ...)# S3 method for default
winsorize(
x,
standardized = FALSE,
centerFun = median,
scaleFun = mad,
const = 2,
return = c("data", "weights"),
...
)
# S3 method for matrix
winsorize(
x,
standardized = FALSE,
centerFun = median,
scaleFun = mad,
const = 2,
prob = 0.95,
tol = .Machine$double.eps^0.5,
return = c("data", "weights"),
...
)
# S3 method for data.frame
winsorize(x, ...)
If standardize
is TRUE
and return
is "weights"
,
a set of data cleaning weights. Multiplying each observation of the
standardized data by the corresponding weight yields the cleaned
standardized data.
Otherwise an object of the same type as the original data x
containing the cleaned data is returned.
a numeric vector, matrix or data frame to be cleaned.
for the generic function, additional arguments to be passed
down to methods. For the "data.frame"
method, additional arguments
to be passed down to the "matrix"
method. For the other methods,
additional arguments to be passed down to
robStandardize
.
a logical indicating whether the data are already robustly standardized.
a function to compute a robust estimate for the center to
be used for robust standardization (defaults to
median
). Ignored if standardized
is TRUE
.
a function to compute a robust estimate for the scale to
be used for robust standardization (defaults to mad
).
Ignored if standardized
is TRUE
.
numeric; tuning constant to be used in univariate winsorization (defaults to 2).
character string; if standardized
is TRUE
,
this specifies the type of return value. Possible values are "data"
for returning the cleaned data, or "weights"
for returning data
cleaning weights.
numeric; probability for the quantile of the \(\chi^{2}\) distribution to be used in multivariate winsorization (defaults to 0.95).
a small positive numeric value used to determine singularity
issues in the computation of correlation estimates based on bivariate
winsorization (see corHuber
).
Andreas Alfons, based on code by Jafar A. Khan, Stefan Van Aelst and Ruben H. Zamar
The borders of the main part of the data are defined on the scale of the
robustly standardized data. In the univariate case, the borders are given
by \(+/-\)const
, thus a symmetric distribution is assumed. In the
multivariate case, a normal distribution is assumed and the data are
shrunken towards the boundary of a tolerance ellipse with coverage
probability prob
. The boundary of this ellipse is thereby given by
all points that have a squared Mahalanobis distance equal to the quantile of
the \(\chi^{2}\) distribution given by prob
.
Khan, J.A., Van Aelst, S. and Zamar, R.H. (2007) Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289--1299. tools:::Rd_expr_doi("10.1198/016214507000000950")
corHuber
## generate data
set.seed(1234) # for reproducibility
x <- rnorm(10) # standard normal
x[1] <- x[1] * 10 # introduce outlier
## winsorize data
x
winsorize(x)
Run the code above in your browser using DataLab