Let \(x_1, x_2, \ldots, x_n\) denote a random sample of n observations
from some unknown probability distribution (i.e., the elements of the argument
obs
), and let \(x_{(i)}\) denote the \(i^{th}\) order statistic, that is,
the \(i^{th}\) largest observation, for \(i = 1, 2, \ldots, n\).
Estimating Density
The function demp
computes the empirical probability density function. If
the observations are assumed to come from a discrete distribution, the probability
density (mass) function is estimated by:
$$\hat{f}(x) = \widehat{Pr}(X = x) = \frac{\sum^n_{i=1} I_{[x]}(x_i)}{n}$$
where \(I\) is the indicator function:
\(I_{[x]}(y) =\) | \(1\) | if \(y = x\), |
| \(0\) | if \(y \ne x\) |
That is, the estimated probability of observing the value \(x\) is simply the
observed proportion of observations equal to \(x\).
If the observations are assumed to come from a continuous distribution, the
function demp
calls the R function density
to compute the
estimated density based on the values specified in the argument obs
,
and then uses linear interpolation to estimate the density at the values
specified in the argument x
. See the R help file for
density
for more information on how the empirical density is
computed in the continuous case.
Estimating Probabilities
The function pemp
computes the estimated cumulative distribution function
(cdf), also called the empirical cdf (ecdf). If the observations are assumed to
come from a discrete distribution, the value of the cdf evaluated at the \(i^{th}\)
order statistic is usually estimated by:
$$\hat{F}[x_{(i)}] = \widehat{Pr}(X \le x_{(i)}) = \hat{p}_i =
\frac{\sum^n_{j=1} I_{(-\infty, x_{(i)}]}(x_j)}{n}$$
where:
\(I_{(-\infty, x]}(y) =\) | \(1\) | if \(y \le x\), |
| \(0\) | if \(y > x\) |
(D'Agostino, 1986a). That is, the estimated value of the cdf at the \(i^{th}\)
order statistic is simply the observed proportion of observations less than or
equal to the \(i^{th}\) order statistic. This estimator is sometimes called the
“empirical probabilities” estimator and is intuitively appealing.
The function pemp
uses the above equations to compute the empirical cdf when
prob.method="emp.probs"
.
For any general value of \(x\), when the observations are assumed to come from a
discrete distribution, the value of the cdf is estimated by:
\(\hat{F}(x) =\) | \(0\) | if \(x < x_{(1)}\), |
| \(\hat{p}_i\) | if \(x_{(i)} \le x < x_{(i+1)}\), |
| \(1\) | if \(x \ge x_{(n)}\) |
The function pemp
uses the above equation when discrete=TRUE
.
If the observations are assumed to come from a continuous distribution, the value
of the cdf evaluated at the \(i^{th}\) order statistic is usually estimated by:
$$\hat{F}[x_{(i)}] = \hat{p}_i = \frac{i - a}{n - 2a + 1}$$
where \(a\) denotes the plotting position constant and \(0 \le a \le 1\)
(Cleveland, 1993, p.18; D'Agostino, 1986a, pp.8,25). The estimators defined by
the above equation are called plotting positions and are used to construct
probability plots. The function pemp
uses the above equation
when
prob.method="plot.pos"
.
For any general value of \(x\), the value of the cdf is estimated by linear
interpolation:
\(\hat{F}(x) =\) | \(\hat{p}_1\) | if \(x < x_{(1)}\), |
| \((1 - r)\hat{p}_i + r\hat{p}_{i+1}\) | if \(x_{(i)} \le x < x_{(i+1)}\), |
| \(\hat{p}_n\) | if \(x \ge x_{(n)}\) |
where
$$r = \frac{x - x_{(i)}}{x_{(i+1)} - x_{(i)}}$$
(Chambers et al., 1983). The function pemp
uses the above two equations
when discrete=FALSE
.
Estimating Quantiles
The function qemp
computes the estimated quantiles based on the observed
data. If the observations are assumed to come from a discrete distribution, the
\(p^{th}\) quantile is usually estimated by:
\(\hat{x}_p =\) | \(x_{(1)}\) | if \(p \le \hat{p}_1\), |
| \(x_{(i)}\) | if \(\hat{p}_{i-1} < p \le \hat{p}_i\), |
| \(x_n\) | if \(p > \hat{p}_n\) |
The function qemp
uses the above equation when discrete=TRUE
.
If the observations are assumed to come from a continuous distribution, the
\(p^{th}\) quantile is usually estimated by linear interpolation:
\(\hat{x}_p =\) | \(x_{(1)}\) | if \(p \le \hat{p}_1\), |
| \((1 - r)x_{(i-1)} + rx_{(i)}\) | if \(\hat{p}_{i-1} < p \le \hat{p}_i\), |
| \(x_n\) | if \(p > \hat{p}_n\) |
where
$$r = \frac{p - \hat{p}_{i-1}}{\hat{p}_i - \hat{p}_{i-1}}$$
The function qemp
uses the above two equations when discrete=FALSE
.
Generating Random Numbers From the Empirical Distribution
The function remp
simply calls the R function sample
to
sample the elements of obs
with replacement.