homogen: Automatic homogenization of climatological series

Description

Automatic homogenization of climatological series, including missing data filling and detection and correction of outliers and shifts in the mean of the series.

Usage

homogen(varcli, anyi, anyf, suf=NA, nm=NA,  nref=c(10,10,4), std=3, swa=NA,
ndec=1, dz.max=5, dz.min=-dz.max, wd=c(0,0,100), snht1=25, snht2=snht1,
tol=.02, maxdif=NA, mxdif=maxdif, maxite=999, force=FALSE, wz=.001, trf=0,
mndat=NA, gp=3, ini=NA, na.strings="NA", vmin=NA, vmax=NA, nclust=100,
cutlev=NA, grdcol=grey(.4), mapcol=grey(.4), hires=TRUE, expl=FALSE,
metad=FALSE, sufbrk='m', tinc=NA, tz='UTC', cex=1.2, verb=TRUE)

Arguments

varcli

Acronym of the name of the studied climatic variable, as in the input data files.

anyi

Initial year of the data present in the file.

anyf

Final year of the data present in the file.

suf

Optional suffix appended with a '-' to the name of the variable in the input files.

Number of data per year in each station. (Defaults to NA, and then it will be computed from the total number of data).

nref

Maximum number of references for data estimation. (Defaults to 10 in the detection stages, and to 4 in the final series adjustments).

std

Type of normalization:

1:: deviations from the mean,
2:: rates to the mean (only for means greater than 1),
3:: standardization (subtract the mean and divide by the sample standard deviation).

swa

Size of the step forward to be applied to the staggered window application of SNHT. If not set (the default), 365 terms (one year) will be used for daily data, and 60 otherwise.

ndec

Number of decimal digits to which the homogenized data must be rounded.

dz.max

Threshold of outlier tolerance, in standard deviations. (5 by default).

dz.min

Lower threshold of outlier tolerance if different from the higher one. (By default, they will be the same, with opposite signs.) If positive, its sign will be changed.

Distance (in km) at which reference data will weigh half of those of another located at zero distance. (Defaults to c(0,0,100), meaning that no weighting will be applied in the first two stages, and 100 km in the third).

snht1

Threshold value for the stepped SNHT window test applied in stage 1. (25 by default. No SNHT analysis will be performed if snht1=0).

snht2

Threshold value for the SNHT test when applied to the complete series in stage 2 (same value as snht1 by default).

tol

Tolerance factor to split several series at a time. The default is 0.02, meaning that a 2% will be allowed for every reference data. (E.g.: if the maximum SNHT test value in a series is 30 and 10 references were used to compute the anomalies, the series will be split if the maximum test of the reference series is lower than 30*(1+0.02*10)=36. Set tol=0 to disable further splits when any reference series has already been split at the same iteration).

maxdif

Maximum difference of any data item in consecutive iterations. If not set, defaults to half of the data precision (defined by the number of decimals).

mxdif

Old maxdif parameter (maintained for compatilibility).

maxite

Maximum number of iterations when computing the means of the series. (999 by default).

force

Force break even when only one reference is available. (FALSE by default).

Scale parameter of the vertical coordinate Z. 0.001 by default, which gives the vertical coordinate (in m) the same weight as the horizontal coordinates (internally managed in km).

trf

By default, data are not transformed (trf=0), but if the data frequency distribution is very skewed, the user can choose to apply a log(x+1) transformation (trf=1) or any root of index trf>1 (2 for square root, 3 for cubic root, etc. Fractional numbers are allowed).

mndat

Minimum number of data for a split fragment to become a new series. It defaults to half of the swa value for daily data, or to nm otherwise, with a minimum of 5 terms.

Graphic parameter:

0:: no graphic output,
1:: only descriptive graphics of the input data,
2:: as with 1, plus diagnostic graphics of anomalies,
3:: as with 2, plus graphics of running annual means and applied corrections,
4:: as with 3, but running annual totals (instead of means) will be plotted in the last set of graphics. (Better when working with precipitation data).

ini

Initial date, with format 'YYYY-MM-DD'. If not set, it will be assumed that the series begin the first of January of the initial year anyi.

na.strings

Character string to be treated as a missing value. (It can be a vector of strings, if more than one is needed). Defaults to 'NA', the standard R missing data code.

vmin

Minimum possible value (lower limit) of the studied variable. Unset by default, but note that vmin=0 will be applied if std is set to 2.

vmax

Maximum possible value (upper limit) of the studied variable. (E.g., for relative humidity or relative sunshine hours it is advisable to set vmax=100).

nclust

Maximum number of stations for the cluster analysis. (If much greater than 100, the default value, the process may be too long and the graphic too dense).

cutlev

Level to cut the dendrogram to define clusters (automatic by default).

grdcol

Color of the graphic background grids. (Gray by default.)

mapcol

Color of the background map. (Gray by default.)

hires

By default, the background map will be drawn in high resolution. Set this parameter to FALSE if you are studying a big geographical area (>1000 km).

expl

Set this to TRUE to perform an exploratory analysis.

metad

Set this to TRUE if a metadata file is provided (see the details).

sufbrk

Suffix to add to varcli to form the name of the provided metadata file. This parameter is only relevant when metad=TRUE. Its default value 'm' is meant to read the file of break-points detected at the monthly scale. (Do not set a sufbrk longer than 3 characters to avoid interferences when homogen is called from the homogsplit function.)

tinc

Time increment between data. Not set by default, but can be defined for subdaily data, as in e.g.: tinc='3 hour'.

Time zone. Only relevant for subdaily data. ('UTC' by default.)

cex

Character expansion factor for graphic labels and titles. (Defaults to 1.2. Note that if station names are long, they will not fit in titles when increasing this parameter too much.)

verb

Verbosity. Set to FALSE to avoid messages being output to the console. (They will be in the output log file anyway).

Value

This function does not return any value, its results being saved to files with the same base name as the input files, and extensions:

*.txt:

A text file that logs all the processing output,

*_out.csv:

List of corrected outliers,

*_brk.csv:

List of corrected breaks,

*.pdf:

PDF file with a collection of diagnostic graphics,

*.rda:

Homogenization results in R binary format, used by dahstat and dahgrid post-processing functions, but can be loaded by the user for further with the function load). This file contains the following objects:

dat: matrix or array of original data,
dah: matrix or array of homogenized data,
est.c: data frame with columns: X (X coordinate), Y (Y coordinate), Z (elevation), Code (code of the station), Name (name of the station), pod (percentage of original data), ios (index of original station), ope (operating at the end of the period? 0=no, 1=yes), snht (relative SNTH of the homogenized series) rmse (estimated root mean squared errors of the homogenized series)
nd: number of time steps in every series,
nei: number of input series,
ne: number of series after the homogenization,
nm: number of "months" in a year (0=daily data),
x: vector of the time dimension,
ndec: number of decimals in the data,
std: type of standardization used (as explained in the details),
ini: initial date of the period under study

Details

Input data must be provided in two text files, one with the data (with extension dat) and another with the station coordinates (with extension est). Both have as base name, VAR_YEAR-YEAR, composed by the acronym of the climatological variable, and the initial and final years of the data, as set in the first three parameters of the call, varcli, anyi and anyf.

Data are stored in a free blank separated format (any number of data items per line is allowed), in chronological order, station by station (all data from station 1 go first, then all data from station 2, and so on). As dates are not stored in this file, all data must be present in the file, using a code for any missing data in the records (NA by default, but any other code can be used, provided that they are specified in the parameter na.strings).

The stations file, with extension est, is also a blank separated text file where each line identifies a single station, with structure 'X Y Z CODE NAME': Coordinates X and Y may be in geographical degrees (longitude and latitude, in decimal form), or they will be assumed to be in km, or in m if the mean of either X and Y is greater than 10000; elevation Z must be supplied in m; and the identification CODE and the full NAME of the station must be quoted if they contains blanks). Fully reliable series may be marked by putting an asterisk (*) at the beginning of their CODE to skip their outlier and break analysis. This is not recommended with observed series, but can be useful when using reanalysis series as references in data sparse regions.

The transformation of the input data may be very useful to normalize highly skewed (L-shaped) distributed variables (as is often the case with precipitation, wind speed, ...), but use it with caution. Alternatively, you can use the rate normalization (std=2) on the raw data if the variable has a natural zero lower limit. (This alternative has yielded better results than transformation in some applications, provided that no homogeneous sub-period has means much lower than 1. If this is the case, a workaround may be to multiply all data values by a constant prior to their homogenization).

The default values of dz.max, snht1 and snht2 can be appropriate for monthly values of temperature, but not so much for precipitation or for daily series. Therefore it is advisable to adjust them empirically with the help of the histograms in the graphic output of a first exploratory application, using expl=TRUE in the call to the function.

This graphic output includes: a) a summary of the data availability and frequency distribution; b) a correlogram of the first differences of the series; c) a dendrogram based on these correlations and a map with the station locations (marked with numbers if less than 100, and with symbols otherwise; d) graphics of anomalies showing the detected breaks, the minimum distance to a reference data and the number of references used; e) a histogram of maximum SNHT values found in overlapping window analysis; d) and e) are repeated for the analysis on the whole series; f) histograms of number of splits per station and per year; g) graphics of final anomalies of the series h) graphics of the reconstructed series and applied corrections; i) a histogram of the normalized anomalies of all data (may help to set rejection thresholds for the outliers); final histograms of SNHT values; and j) a plot of quality/singularity of the stations (a bad score may be due to a bad quality of the series, but also to a singular siting with a peculiar micro-climate).

Note that every time that a significant shift in the mean of the series is detected, it will be split into two (potentially) homogeneous sub-periods, and hence the final number of homogenized series will be increased, as complete homogeneous series will be reconstructed from them. When several homogeneous series have been yielded for the same location, the user can choose to use that reconstructed from the last sub-period (for climate monitoring), from the period with the highest percentage of original data (the more robust reconstruction), the lowest final SNHT (the more homogeneous sub-period, although short sub-periods tend to appear more homogeneous and this can be misleading), or all of them (e.g., for climatic mapping, when no a priori knowledge can indicate which of the sub-periods will be more representative at the studied spatial scale). Aditional columns pod, ios, ope, snht and rmse in the stations data frame can help the user to choose the most appropriate series for each location when using the post-processing functions dahstat and dahgrid (see below).

This function will stop with an error condition if any time step becomes void of data in all stations at the same time. Ideally the user would repeat the process after adding one or more series with data in the void time steps. Alternatively, setting fillall=FALSE will let the process to continue, but the final homogenized series will contain missing data, and the post-processing function dahstat will give errors or results with missing data.

References

Guijarro JA (2014): Quality Control and Homogenization of Climatological Series. In Eslamian S (Ed.), Handbook of Engineering Hydrology, Vol. 1: Fundamentals and Applications. Francis and Taylor, CRC Group, USA, ISBN 9781466552357, 636 pp.

Guijarro JA (2015): Homogenization of Spanish mean wind speed monthly series. In 8th Seminar for Homogenization and Quality Control in Climatological Databases and Third Conference on Spatial Interpolation Techniques in Climatology and Meteorology, Budapest, 12-16 May 2014 (Lakatos et al., Eds.), WMO Climate Data and Monitoring WCDMP-No. 84, pp. 98-106.

Azorin-Molina C, Guijarro JA, McVicar TR, Vicente-Serrano SM, Chen D, Jerez S, Esp<ed>rito-Santo F (2016): Trends of daily peak wind gusts in Spain and Portugal, 1961-2014. J. Geophys. Res. Atmos., 121, doi:10.1002/2015JD024485, 20 pp.

Examples

Run this code

# NOT RUN {
#Set a temporal working directory and write input files:
wd <- tempdir()
wd0 <- setwd(wd)
data(Ptest)
dim(dat) <- c(720,20)
dat[601:720,5] <- dat[601:720,5]*1.8
write(dat[481:720,1:5],'pcp_1991-2010.dat')
write.table(est.c[1:5,1:5],'pcp_1991-2010.est',row.names=FALSE,col.names=FALSE)
#Now run the example:
homogen('pcp',1991,2010,std=2) #homogenization
#Return to user's working directory:
setwd(wd0)
#Input and output files can be found in directory:
print(wd)
# }

Run the code above in your browser using DataLab