homogsplit: Apply homogen() on overlapping rectangular areas.

Description

If the number of series is too big to be homogenized all at a time (normally several thousands, depending on the computer resources), this function can homogenize them by splitting the geographical domain in overlapping rectangular areas.

Usage

homogsplit(varcli, anyi, anyf, xc=NULL, yc=NULL, xo=.5, yo=.38, maponly=FALSE,
suf=NA, nm=NA, nref=c(10,10,4), swa=NA, std=3, ndec=1, dz.max=5,
dz.min=-dz.max, wd=c(0,0,100), snht1=25, snht2=snht1, tol=.02, maxdif=NA,
mxdif=maxdif, force=FALSE, wz=.001, trf=0, mndat=NA, gp=3, ini=NA,
na.strings="NA", maxite=999, vmin=NA, vmax=NA, nclust=100, grdcol=grey(.4),
mapcol=grey(.4), hires=TRUE, expl=FALSE, metad=FALSE, sufbrk='m', tinc=NA,
tz='UTC', cex=1.2, verb=TRUE, x=NA)

Arguments

varcli

Acronym of the name of the studied climatic variable, as in the data file name.

anyi

Initial year of the data present in the file.

anyf

Final year of the data present in the file.

Vector of X axis coordinates setting the domain splitting meridians.

Vector of Y axis coordinates setting the domain splitting parallels.

Overlapping width in the East-West direction.

Overlapping width in the North-South direction.

maponly

Do not homogenize. Only draw a map with stations locations and domain partitioning.

suf

Optional suffix appended with a '-' to the name of the variable in the input files.

Number of data per year in each station. (Defaults to NA, and then it will be computed from the total number of data).

nref

Maximum number of references for data estimation. (Defaults to 10 in the detection stages, and to 4 in the final series adjustments).

swa

Size of the step forward to be applied to the staggered window application of SNHT. If not set (the default), 365 terms (one year) will be used for daily data, and 60 otherwise.

std

Type of normalization:

1:: deviations from the mean,
2:: rates to the mean (only for means greater than 1),
3:: standardization (subtract the mean and divide by the sample standard deviation).

ndec

Number of decimal digits to which the homogenized data must be rounded.

dz.max

Threshold of outlier tolerance, in standard deviations. (5 by default).

dz.min

Lower threshold of outlier tolerance if different from the higher one. (By default, they will be the same, with opposite signs).

Distance (in km) at which reference data will weigh half of that of another located at zero distance. (Defaults to c(0,0,100), meaning that no weighting will be applied in the first two stages, and 100 km in the third).

snht1

Threshold value for the stepped SNHT window test applied in stage 1. (25 by default. No SNHT analysis will be performed if snht1=0).

snht2

Threshold value for the SNHT test when applied to the complete series in stage 2 (same value as snht1 by default).

tol

Tolerance factor to split several series at a time. The default is 0.02, meaning that a 2% will be allowed for every reference data. (E.g.: if the maximum SNHT test value in a series is 30 and 10 references were used to compute the anomalies, the series will be split if the maximum test of the reference series is lower than 30*(1+0.02*10)=36. Set tol=0 to disable further splits when any reference series has already been split at the same iteration).

maxdif

Maximum difference of any data item in consecutive iterations. If not set, defaults to half of the data precision (defined by the number of decimals).

mxdif

Old maxdif parameter (maintained for compatilibility).

force

Force break even when only one reference is available. (FALSE by default).

Scale parameter of the vertical coordinate Z. 0.001 by default, which gives the vertical coordinate (in m) the same weight as the horizontal coordinates (internally managed in km).

trf

By default, data are not transformed (trf=0), but if the data frequency distribution is very skewed, the user can choose to apply a log(x+1) transformation (trf=1) or any root of index trf>1 (2 for square root, 3 for cubic root, etc. Fractional numbers are allowed).

mndat

Minimum number of data for a split fragment to become a new series. It defaults to half of the swa value for daily data, or to nm otherwise, with a minimum of 5 terms.

Graphic parameter:

0:: no graphic output,
1:: only descriptive graphics of the input data,
2:: as with 1, plus diagnostic graphics of anomalies,
3:: as with 2, plus graphics of running annual means and applied corrections,
4:: as with 3, but running annual totals (instead of means) will be plotted in the last set of graphics. (Better when working with precipitation data).

ini

Initial date, with format 'YYYY-MM-DD'. If not set, it will be assumed that the series begin the first of January of the initial year anyi.

na.strings

Character string to be treated as a missing value. (It can be a vector of strings, if more than one is needed). Defaults to 'NA', the standard missing data code in R.

maxite

Maximum number of iterations when computing the means of the series. (999 by default).

vmin

Minimum possible value (lower limit) of the studied variable. Unset by default, but note that vmin=0 will be applied if std is set to 2.

vmax

Maximum possible value (upper limit) of the studied variable. (E.g., for relative humidity or relative sunshine hours it is advisable to set vmax=100).

nclust

Maximum number of stations for the cluster analysis. (If much greater than 100, the default value, the process may be too long and the graphic too dense).

grdcol

Color of the graphic background grids. (Gray by default.)

mapcol

Color of the background map. (Gray by default).

hires

By default, the background map will be drawn in high resolution. Set this parameter to FALSE if you are studying a big geographical area (>1000 km).

expl

Set this to TRUE to perform an exploratory analysis.

metad

Set this to TRUE if a metadata file is provided (see the details).

sufbrk

Suffix to add to varcli to form the name of the provided metadata file. This parameter is only relevant when metad=TRUE. Its default value 'm' is meant to read the file of break-points detected at the monthly scale.

tinc

Time increment between data. Not set by default, but can be defined for subdaily data, as in e.g.: tinc='3 hour'.

Time zone. Only relevant for subdaily data. ('UTC' by default.)

cex

Character expansion factor for graphic labels and titles. (Defaults to 1.2. Note that if station names are long, they will not fit in titles when increasing this parameter too much.)

verb

Verbosity. Set to FALSE to avoid messages being output to the console. (They will be in the output log file anyway).

Vector of dates. (To be read from the *.rda file.)

Value

This function does not return any value.

Details

First of all take into account that this is an experimental function, and will fail if there are time steps completely void of data in any sub-area.

If you have not decided the splitting meridians and parallels, do not set them, and the function will provide a map to help in selecting the areas.

If you set the xc and yc splitting borders, setting maponly=TRUE will also produce a map with the stations, plus the required overlapping areas, without doing any homogenization. In this way you can review the limits of the areas to choose new ones if you are not happy with the current partitioning.

All parameters except xc, yc, xo, yo and maponly are the same as in the homogen function, and will be passed to it to perform the homogenization.

If a rectangular area include less than 10 stations, these will be added to the next area. Warning! If it is the last area, they will not be processed, and the homogenization results will have inconsistent number of stations. In this case the user should try a new set of cutting limits.

One graphic output file will be produce for every area containing stations, but the rest of the output will be merged into single files.

Examples

Run this code

# NOT RUN {
#Set a temporal working directory and write input files:
wd <- tempdir()
wd0 <- setwd(wd)
data(Ptest)
dim(dat) <- c(720,20)
dat[601:720,5] <- dat[601:720,5]*1.8
write(dat[481:720,1:12],'pcp_1991-2010.dat')
write.table(est.c[1:12,1:5],'pcp_1991-2010.est',row.names=FALSE,col.names=FALSE)
#Now run the example:
homogsplit('pcp',1991,2010,2.9,39.6,0,0,std=2)
#Return to user's working directory:
setwd(wd0)
#Input and output files can be found in directory:
print(wd)
#
# Note that this is just a trivial example; this function is intended to
# homogenize a network of thousands of long series which might overload
# computer resources, and is still experimental.
# }

Run the code above in your browser using DataLab