despike: Remove spikes from a time series

Description

The method identifies spikes with respect to a "reference" time-series, and replaces these spikes with the reference value, or with NA according to the value of action; see “Details”.

Usage

despike(
  x,
  reference = c("median", "smooth", "trim"),
  n = 4,
  k = 7,
  min = NA,
  max = NA,
  replace = c("reference", "NA"),
  skip
)

Value

A new vector in which spikes are replaced as described above.

Arguments

x: a vector of (time-series) values, a list of vectors, a data frame, or an oce object.
reference: indication of the type of reference time series to be used in the detection of spikes; see ‘Details’.
n: an indication of the limit to differences between x and the reference time series, used for reference="median" or reference="smooth"; see ‘Details.’
k: length of running median used with reference="median", and ignored for other values of reference.
min: minimum non-spike value of x, used with reference="trim".
max: maximum non-spike value of x, used with reference="trim".
replace: an indication of what to do with spike values, with "reference" indicating to replace them with the reference time series, and "NA" indicating to replace them with NA.
skip: optional vector naming columns to be skipped. This is ignored if x is a simple vector. Any items named in skip will be passed through to the return value without modification. In some cases, despike will set up reasonable defaults for skip, e.g. for a ctd object, skip will be set to c("time", "scan", "pressure") if it is not supplied as an argument.

Author

Dan Kelley

Details

Three modes of operation are permitted, depending on the value of reference.

For reference="median", the first step is to linearly interpolate across any gaps (spots where x==NA), using approx() with rule=2. The second step is to pass this through runmed() to get a running median spanning k elements. The result of these two steps is the "reference" time-series. Then, the standard deviation of the difference between x and the reference is calculated. Any x values that differ from the reference by more than n times this standard deviation are considered to be spikes. If replace="reference", the spike values are replaced with the reference, and the resultant time series is returned. If replace="NA", the spikes are replaced with NA, and that result is returned.
For reference="smooth", the processing is the same as for "median", except that smooth() is used to calculate the reference time series.
For reference="trim", the reference time series is constructed by linear interpolation across any regions in which x<min or x>max. (Again, this is done with approx() with rule=2.) In this case, the value of n is ignored, and the return value is the same as x, except that spikes are replaced with the reference series (if replace="reference" or with NA, if replace="NA".

Examples

Run this code

n <- 50
x <- 1:n
y <- rnorm(n=n)
y[n/2] <- 10                    # 10 standard deviations
plot(x, y, type='l')
lines(x, despike(y), col='red')
lines(x, despike(y, reference="smooth"), col='darkgreen')
lines(x, despike(y, reference="trim", min=-3, max=3), col='blue')
legend("topright", lwd=1, col=c("black", "red", "darkgreen", "blue"),
       legend=c("raw", "median", "smooth", "trim"))

# add a spike to a CTD object
data(ctd)
plot(ctd)
T <- ctd[["temperature"]]
T[10] <- T[10] + 10
ctd[["temperature"]] <- T
CTD <- despike(ctd)
plot(CTD)

Run the code above in your browser using DataLab