A wrapper for the roll_hampel() function that counts outliers using either a user specified threshold value or a threshold value based on the statistics of the incoming data.
findOutliers(
x,
n = 41,
thresholdMin = 10,
selectivity = NA,
increment = 1,
fixedThreshold = TRUE
)
A vector of indices associated with outliers in the incoming data x
.
an R numeric vector
integer window size
initial value for outlier detection
value between [0-1] used in determining outliers, or NA
if fixedThreshold=TRUE
.
integer shift to use when sliding the window to the next location
logical specifying whether outlier detection uses selectivity
(see below)
The thresholdMin
level is similar to a sigma value for normally distributed data.
Hampel filter values above 6 indicate a data value that is extremely unlikely
to be part of a normal distribution (~ 1/500 million) and therefore very likely to be an outlier. By
choosing a relatively large value for thresholdMin
we make it less likely that we
will generate false positives. False positives can include high frequency environmental noise.
With the default setting of fixedThreshold=TRUE
any value above the threshold is considered an outlier
and the selectivity
is ignored.
The selectivity
is a value between 0 and 1 and is used to generate an appropriate
threshold for outlier detection based on the statistics of the incoming data. A lower value
for selectivity
will result in more outliers while a value closer to 1.0 will result in
fewer. If fixedThreshold=TRUE
, selectivity
may have a value of NA
.
When the user specifies fixedThreshold=FALSE
, the thresholdMin
and selectivity
parameters work like squelch and volume on a CB radio: thresholdMin
sets a noise threshold
below which you don't want anything returned while selectivity
adjusts the number of points defined as outliers
by setting a new threshold defined by the maximum value of roll_hampel
multiplied by selectivity
.
n
, the windowSize, is a parameter that is passed to roll_hampel()
.
The default value of increment=1
should not be changed. Outliers are defined
as individual points that stand apart from their neighbors. Applying the Hampel filter to
every other point by using increment
> 1 will invariably miss some of the outliers.
roll_hampel
# Noisy sinusoid with outliers
a <- jitter(sin(0.1*seq(1e4)),amount=0.2)
indices <- sample(seq(1e4),20)
a[indices] <- a[indices]*10
# Outlier detection should identify many of these altered indices
sort(indices)
findOutliers(a)
Run the code above in your browser using DataLab