distLquantile: distribution quantiles

Description

Parametric quantiles of distributions fitted to a sample.

Usage

distLquantile(
  x = NULL,
  probs = c(0.8, 0.9, 0.99),
  truncate = 0,
  threshold = quantileMean(dlf$dat_full[is.finite(dlf$dat_full)], truncate),
  sanerange = NA,
  sanevals = NA,
  selection = NULL,
  order = TRUE,
  dlf = NULL,
  datname = deparse(substitute(x)),
  list = FALSE,
  empirical = TRUE,
  qemp.type = 8,
  weighted = empirical,
  gpd = empirical,
  speed = TRUE,
  quiet = FALSE,
  ssquiet = quiet,
  ttquiet = quiet,
  gpquiet = missing(quiet) | quiet,
  ...
)

Value

if list=FALSE (default): invisible matrix with distribution quantile values . if list=TRUE: invisible dlf object, see printL

Arguments

x: Sample for which parametric quantiles are to be calculated. If it is NULL (the default), dat from dlf is used. DEFAULT: NULL
probs: Numeric vector of probabilities with values in [0,1]. DEFAULT: c(0.8,0.9,0.99)
truncate: Number between 0 and 1 (proportion of sample discarded). Censored quantile: fit to highest values only (truncate lower proportion of x). Probabilities are adjusted accordingly. DEFAULT: 0
threshold: POT cutoff value. If you want correct percentiles, set this only via truncate, see Details of q_gpd. DEFAULT: quantileMean(x, truncate)
sanerange: Range outside of which results should be changed to sanevals. This can capture numerical errors in small samples (notably GPD_MLE_extRemes). If NA, this is ignored. Attention: the RMSE column is also checked and changed. DEFAULT: NA
sanevals: Values to be used below [1] and above [2] sanerange. DEFAULT: NA
selection: Distribution type, eg. "gev" or "wak", see lmomco::dist.list. Can be a vector. If NULL (the default), all types present in dlf$distnames are used. DEFAULT: NULL
order: Logical: sort by RMSE, even if selection is given? See distLweights. DEFAULT: TRUE
dlf: dlf object described in extremeStat. Use this to save computing time for large datasets where you already have dlf. DEFAULT: NULL
datname: Character string: data name, important if list=TRUE. DEFAULT: deparse(substitute(x))
list: Return full dlflist with output attached as element quant? If FALSE (the default), just the matrix with quantile estimates is returned. DEFAULT: FALSE
empirical: Add rows "empirical" and "quantileMean" in the output matrix? Uses quantile with qemp.type (ignoring truncation) and quantileMean. DEFAULT: TRUE
qemp.type: Method passed to quantile for row "empirical". Only used if empirical=TRUE. DEFAULT: 8 (NOT the stats::quantile default)
weighted: Include weighted averages across distribution functions to the output? DEFAULT: empirical, so additional options can all be excluded with emp=F.
gpd: Include GPD quantile estimation via q_gpd? Note that the 'GPD_LMO_lmomco' result differs slightly from 'gpa', especially if truncate=0. This comes from using x>threshold (all 'GPD_*' distributions) or x>=threshold ('gpa' and all other distributions in extremeStat). DEFAULT: empirical
speed: Compute q_gpd only for fast methods? Currently, only the Bayesian method is excluded. DEFAULT: TRUE
quiet: Suppress notes? If it is actually set to FALSE (not missing), gpquiet is set to FALSE to print all the warnings including stacks. DEFAULT: FALSE
ssquiet: Suppress sample size notes? DEFAULT: quiet
ttquiet: Suppress truncation!=threshold note? DEFAULT: quiet
gpquiet: Suppress warnings in q_gpd? DEFAULT: TRUE if quiet is not specified, else quiet
...: Arguments passed to distLfit and distLweights like weightc, ks=TRUE

Author

Berry Boessenkool, berry-b@gmx.de, March + July 2015, Feb 2016

Details

Very high quantiles (99% and higher) need large sample sizes for quantile to yield a robust estimate. Theoretically, at least 1/(1-probs) values must be present, e.g. 10'000 for Q99.99%. With smaller sample sizes (eg n=35), they underestimate the actual (but unknown) quantile. Parametric quantiles need only small sample sizes. They don't have a systematical underestimation bias, but have higher variability.

References

On GPD: https://stats.stackexchange.com/questions/69438

Examples

Run this code


data(annMax) # Annual Discharge Maxima (streamflow)

distLquantile(annMax, emp=FALSE)[,] # several distribution functions in lmomco

if (FALSE) {
## Taken out from CRAN package check because it's slow
distLquantile(annMax, truncate=0.8, probs=0.95)[,] # POT (annMax already block maxima)
dlf <- distLquantile(annMax, probs=0.95, list=TRUE)
plotLquantile(dlf, linargs=list(lwd=3), nbest=5, breaks=10)
dlf$quant
# Parametric 95% quantile estimates range from 92 to 111!
# But the best fitting distributions all lie aroud 103.

# compare General Pareto Fitting methods
# Theoretically, the tails of distributions converge to GPD (General Pareto)
# q_gpd compares several R packages for fitting and quantile estimation:
dlq <- distLquantile(annMax, weighted=FALSE, quiet=TRUE, probs=0.97, list=TRUE)
dlq$quant
plotLquantile(dlq) # per default best fitting distribution functions
plotLquantile(dlq, row=c("wak","GPD*"), nbest=14)
#pdf("dummy.pdf", width=9)
plotLquantile(dlq, row="GPD*", nbest=13, xlim=c(102,110),
          linargs=list(lwd=3), heights=seq(0.02, 0.005, len=14))
#dev.off()


# Sanity checks: important for very small samples:
x1 <- c(2.6, 2.5, 2.9, 3, 5, 2.7, 2.7, 5.7, 2.8, 3.1, 3.6, 2.6, 5.8, 5.6, 5.7, 5.3)
q1 <- distLquantile(x1, sanerange=c(0,500), sanevals=c(NA,500))
x2 <- c(6.1, 2.4, 4.1, 2.4, 6, 6.3, 2.9, 6.8, 3.5)
q2 <- distLquantile(x2, sanerange=c(0,500), sanevals=c(NA,500), quiet=FALSE)
x3 <- c(4.4, 3, 1.8, 7.3, 2.1, 2.1, 1.8, 1.8)
q3 <- distLquantile(x3, sanerange=c(0,500), sanevals=c(NA,500))

# weighted distribution quantiles are calculated by different weighting schemes:
plotLweights(dlf)

# If speed is important and parameters are already available, pass them via dlf:
distLquantile(dlf=dlf, probs=0:5/5, selection=c("wak","gev","kap"))
distLquantile(dlf=dlf, truncate=0.3, list=TRUE)$truncate

# censored (truncated, trimmed) quantile, Peak Over Treshold (POT) method:
qwak <- distLquantile(annMax, sel="wak", prob=0.95, emp=FALSE, list=TRUE)
plotLquantile(qwak, ylim=c(0,0.06) ); qwak$quant
qwak2 <-distLquantile(annMax, sel="wak", prob=0.95, emp=FALSE, list=TRUE, truncate=0.6)
plotLquantile(qwak2, add=TRUE, distcols="blue")


# Simulation of truncation effect
library(lmomco)
#set.seed(42)
rnum <- rlmomco(n=1e3, para=dlf$parameter$gev)
myprobs <- c(0.9, 0.95, 0.99, 0.999)
mytrunc <- seq(0, 0.9, length.out=20)
trunceffect <- sapply(mytrunc, function(mt) distLquantile(rnum, selection="gev",
                             probs=myprobs, truncate=mt, quiet=TRUE,
                             pempirical=FALSE)["gev",])
# If more values are truncated, the function runs faster

op <- par(mfrow=c(2,1), mar=c(2,4.5,2,0.5), cex.main=1)
dlf1 <- distLquantile(rnum, sel="gev", probs=myprobs, emp=FALSE, list=TRUE)
dlf2 <- distLquantile(rnum, sel="gev", probs=myprobs, emp=FALSE, list=TRUE, truncate=0.3)
plotLquantile(dlf1, ylab="", xlab="")
plotLquantile(dlf2, add=TRUE, distcols=4)
legend("right", c("fitted GEV", "fitted with truncate=0.3"), lty=1, col=c(2,4), bg="white")
par(mar=c(3,4.5,3,0.5))
plot(mytrunc, trunceffect[1,], ylim=range(trunceffect), las=1, type="l",
     main=c("High quantiles of 1000 random numbers from gev distribution",
           "Estimation based on proportion of lower values truncated"),
     xlab="", ylab="parametric quantile")
title(xlab="Proportion censored", mgp=c(1.8,1,0))
for(i in 2:4) lines(mytrunc, trunceffect[i,])
library("berryFunctions")
textField(rep(0.5,4), trunceffect[,11], paste0("Q",myprobs*100,"%") )
par(op)

trunc <- seq(0,0.1,len=200)
dd <- pbsapply(trunc, function(t) distLquantile(annMax,
          selection="gpa", weight=FALSE, truncate=t, prob=0.99, quiet=T)[c(1,3),])
 plot(trunc, dd[1,], type="o", las=1)
lines(trunc, dd[2,], type="o", col=2)


set.seed(3); rnum <- rlmomco(n=1e3, para=dlf$parameter$gpa)
qd99 <- evir::quant(rnum, p=0.99, start=15, end=1000, ci=0.5, models=30)
axis(3, at=seq(-1000,0, length=6), labels=0:5/5, pos=par("usr")[3])
title(xlab="Proportion truncated", line=-3)
mytrunc <- seq(0, 0.9, length.out=30)
trunceffect <- sapply(mytrunc, function(mt) distLquantile(rnum, selection="gpa",
                      probs=0.99, truncate=mt, plot=FALSE, quiet=TRUE,
                      empirical=FALSE, gpd=TRUE))
lines(-1000*(1-mytrunc), trunceffect[1,], col=4)
lines(-1000*(1-mytrunc), trunceffect[2,], col=3) # interesting...
for(i in 3:13) lines(-1000*(1-mytrunc), trunceffect[i,], col=3) # interesting...

# If you want the estimates only for one single truncation, use
q_gpd(rnum, probs=myprobs, truncate=0.5)

} # end dontrun

Run the code above in your browser using DataLab