dat2bernqua: Observed Data to Empirical Quantiles through Bernstein or Kantorovich Polynomials

Description

The empirical quantile function can be “smoothed” (Hernández-Maldonado and others, 2012, p. 114) through the Kantorovich polynomial (Muñoz-Pérez and Fernández-Palacín, 1987) for the sample order statistics $x_{k:n}$ for a sample of size $n$ by $$\tilde{X}_n(F) = \frac{1}{2}\sum_{k=0}^n (x_{k:n} + x_{(k+1):n}) {n \choose k} F^k (1-F)^{n-k}\mbox{,}$$ where $F$ is nonexceedance probability, and $(n\:k)$ are the binomial coefficients from the R function choose(), and the special situations for $k=0$ and $k=n$ are described within the Note section. The form for the Bernstein polynomial is $$\tilde{X}_n(F) = \sum_{k=0}^{n+1} (x_{k:n}) {n+1 \choose k} F^k (1-F)^{n+1-k}\mbox{.}$$ There are subtle differences between the two and dat2bernqua function supports each. Readers are also directed to the Special Attention section.

Turnbull and Ghosh (2014) consider through the direction of a referee and recommendation of $p=0.05$ by that referee (and credit to ideas by de Carvalho [2012]) that the support of the probability density function for the Turnbull and Ghosh (2014) study of Bernstein polynomials can be computed letting $\alpha = (1 - p)^{-2} - 1$ by $$ \biggl(x_{1:n} - (x_{2:n} - x_{1:n})/\alpha,\: x_{n:n} + (x_{n:n} - x_{n-1:n})/\alpha\biggr)\mbox{,}$$ for the minimum and maximum, respectively. Evidently, the original support considered by Turnbull and Ghosh (2014) was $$ \biggl(x_{1:n} - \lambda_2\sqrt{\pi/n},\: x_{n:n} + \lambda_2\sqrt{\pi/n}\biggr)\mbox{,}$$ for the minimum and maximum, respectively and where the standard deviation is estimated in the function using the 2nd L-moment as $s = \lambda\sqrt{\pi}$.

The $p$ is referred to by this author as the “p-factor” this value has great influence in the estimated support of the distribution and therefore distal-tail estimation or performance is sensitive to the value for $p$. General exploratory analysis suggests that the $p$ can be optimized based on information external or internal to the data for shape restrained smoothing. For example, an analyst might have external information as to the expected L-skew of the phenomenon being studied or could use the sample L-skew of the data (internal information) for shape restraint (see pfactor.bernstein).

An alternative formula for smoothing is by Cheng (1995) and is $$\tilde{X}^{\mathrm{Cheng}}_n(F) = \sum_{k=1}^n x_{k:n}\:{n - 1 \choose k-1}\: F^{k-1}(1-F)^{n-k}\mbox{.}$$

Usage

dat2bernqua(f, x, bern.control=NULL,
                  poly.type=c("Bernstein", "Kantorovich", "Cheng", "Parzen",
                              "bernstein", "kantorovich", "cheng", "parzen"),
                  bound.type=c("none", "sd", "Carv", "either", "carv"),
                  fix.lower=NULL, fix.upper=NULL, p=0.05, listem=FALSE)

Value

An R

vector is returned unless the Parzen weighting method is used and in that case an R

list is returned with elements f and x, which respectively are the $F$ values as shown in the formula and the $\tilde{X}^{\mathrm{Parzen}}_n(F)$.

Arguments

f: A vector of nonexceedance probabilities $F$.
x: A vector of data values.
bern.control: A list that holds poly.type, bound.type, fix.lower, and fix.upper. And this list will supersede the respective values provided as separate arguments.
poly.type: The Bernstein or Kantorovich polynomial will be used. The two are quite closely related. Or the formula by Cheng (1995) will be used and bound.type, fix.lower, fix.upper, and p are not applicable. Or the formula credited by Nair et al. (2013, p. 17) to Parzen (1979) will be used.
bound.type: Triggers to the not involve alternative supports ("none") then the minimum and maximum are used unless already provided by the fix.lower or fix.upper, the support based "sd" on the standard deviation, the support "Carv" based on the arguments of de Carvalho (2012), or "either" method.
fix.lower: For $k = 0$, either the known lower bounds is used if provided as non NULL or the observed minimum of the data. If the minimum of the data is less than the fix.lower, a warning is triggered and fix.lower is set to the minimum. Following Turnbull and Ghosh (2014) to avoid bounds that are extremely lower than the data, it will use the estimated lower bounds by the method "sd", "Carv", or "either" if these bounds are larger than the provided fix.lower.
fix.upper: For $k = n$, either the known upper bounds is used if provided as non NULL or the observed maximum of the data; If the maximum of the data is less than the fix.upper, a warning is triggered and fix.upper is set to the maximum.
p: A small probability value to serve as the $p$ in the "Carv" support computation. The default is recommended as mentioned above. The program will return NA if $10^{-6} < p \ge (1-10^{-6})$ is not met. The value p is the “p-factor” $p$.
listem: A logical controlling whether (1) a vector of $\tilde{X}_n(F)$ is returned or (2) a list containing $\tilde{X}_n(F)$, the f, original sample size $n$ of the data, the de Carvalho probability p (whether actually used internally or not), and both fix.lower and fix.upper as computed within the function or provided (less likely) by the function arguments.

Special Attention

The limiting properties of the Bernstein and Kantorovich polynomials differ. The Kantorovich polynomial uses the average of the largest (smallest) value and the respective outer order statistics ($x_{n+1:n}$ or $x_{0:n}$) unlike the Bernstein polynomial whose $F = 0$ or $F = 1$ are purely a function of the outer order statistics. Thus, the Bernstein polynomial can attain the fix.lower and(or) fix.upper whereas the Kantorovich fundamentally can not. For a final comment, the function dat2bernquaf is an inverse of dat2bernqua.

Implentation Note

The function makes use of R functions lchoose and exp and logarithmic expressions, such as $(1-F)^{(n-k)} \rightarrow (n-k)\log(1-F)$, for numerical stability for large sample sizes.

Author

W.H. Asquith

Details

Yet another alternative formula for smoothing if by Parzen (1979) and known as the “Parzen weighting method” is $$\tilde{X}^{\mathrm{Parzen}}_n(F) = n\left(\frac{r}{n} - F\right)x_{r-1:n} + n\left(F - \frac{r-1}{n}\right)x_{r:n}\mbox{,}$$ where $(r-1)/n \le F \le (r/n)$ for $r = 1, 2, \ldots, n$ and $x_{0:n}$ is taken as either the minimum of the data ($\mathrm{min}(x)$) or the lower bounds fix.lower as externally set by the user. For protection, the minimum of $(\mathrm{min}(x),$ fix.lower$)$ is formally used. If the Parzen method is used, the only arguments considered are poly.type and fix.lower; all others are ignored including the f (see Value section). The user does not actually have to provide f in the arguments but a place holder such as f=NULL is required; internally the Parzen method takes over full control. The Parzen method in general is not smooth and not recommended like the others that rely on a polynomial basis function. Further the Parzen method has implicit asymmetry in the estimated $F$. The method has $F=0$ and $F < 1$ on output, but if the data are reversed, then the method has $F > 0$ and $F=1$. Data reversal is made in -X as this example illustrates:


X <- sort(rexp(30))
P <- dat2bernqua(f=NULL,  X, poly.type="Parzen")
R <- dat2bernqua(f=NULL, -X, poly.type="Parzen")
plot(pp(X, a=0.5), X, xlim=c(0, 1)) # Hazen plotting position to
lines(  P$f,  P$x, col="red" )      # basically split the horizontal
lines(1-R$f, -R$x, col="blue")      # differences between blue and red.

References

Cheng, C., 1995, The Bernstein polynomial estimator of a smooth quantile function: Statistics and Probability Letters, v. 24, pp. 321--330.

de Carvalho, M., 2012, A generalization of the Solis-Wets method: Journal of Statistical Planning and Inference, v. 142, no. 3, pp. 633--644.

Hernández-Maldonado, V., Díaz-Viera, M., and Erdely, A., 2012, A joint stochastic simulation method using the Bernstein copula as a flexible tool for modeling nonlinear dependence structures between petrophysical properties: Journal of Petroleum Science and Engineering, v. 90--91, pp. 112--123.

Muñoz-Pérez, J., and Fernández-Palacín, A., 1987, Estimating the quantile function by Bernstein polynomials: Computational Statistics and Data Analysis, v. 5, pp. 391--397.

Nair, N.U., Sankaran, P.G., and Balakrishnan, N., 2013, Quantile-based reliability analysis: Springer, New York.

Turnbull, B.C., and Ghosh, S.K., 2014, Unimodal density estimation using Bernstein polynomials: Computational Statistics and Data Analysis, v. 72, pp. 13--29.

Parzen, E., 1979, Nonparametric statistical data modeling: Journal American Statistical Association, v. 75, pp. 105--122.