enormCensored(x, censored, method = "mle", censoring.side = "left",
ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided",
conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z",
nmc = 1000, seed = NULL, ...)
NA
), undefined (NaN
), and
infinite (Inf
, -Inf
) values are allowed but will be removed.x
are censored.
This must be the same length as x
. If the mode of censored
is
"logical"
, TRUE
values correspond to elements of "mle"
(maximum likelihood; the default),
"bcmle"
(bias-corrected maximum likelihood),
"qq.reg"
"left"
(the default) and "right"
.ci=FALSE
."profile.likelihood"
(profile likelihood; the default),
"normal.approx"
(normal approximation)"two-sided"
(the default), "lower"
, and
"upper"
. This argument is ignored if ci=FALSE
.conf.level=0.95
. This argument is ignored if
ci=FALSE
.ci.type="bootstrap"
. This
argument is ignored if ci=FALSE
and/or ci.method
does not
equal ci.method="normal.approx"
or
ci.method="normal.approx.w.cov"
(see the DETAILS section). The possibci.method="gpq"
. The default is nmc=1000
. This argument is ignored
if ci=FALSE
.set.seed
and used when
ci.method="bootstrap"
or ci.method="gpq"
. The default value is
seed=NULL
, in which case the curreprob.method
. Character string indicating what method to use to
compute the plotting positions (empirical probabilities) whenmethod
is one of"qq.reg"
,<"estimateCensored"
containing the estimated parameters
and other information. See estimateCensored.object
for details.x
or censored
contain any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let $\underline{x}$ denote a vector of $N$ observations from a
normal distribution with mean $\mu$ and
standard deviation $\sigma$. Assume $n$ ($0 < n < N$) of these
observations are known and $c$ ($c=N-n$) of these observations are
all censored below (left-censored) or all censored above (right-censored) at
$k$ fixed censoring levels
$$T_1, T_2, \ldots, T_k; \; k \ge 1 \;\;\;\;\;\; (1)$$
For the case when $k \ge 2$, the data are said to be Type I
multiply censored. For the case when $k=1$,
set $T = T_1$. If the data are left-censored
and all $n$ known observations are greater
than or equal to $T$, or if the data are right-censored and all $n$
known observations are less than or equal to $T$, then the data are
said to be Type I singly censored (Nelson, 1982, p.7), otherwise
they are considered to be Type I multiply censored.
Let $c_j$ denote the number of observations censored below or above censoring
level $T_j$ for $j = 1, 2, \ldots, k$, so that
$$\sum_{i=1}^k c_j = c \;\;\;\;\;\; (2)$$
Let $x_{(1)}, x_{(2)}, \ldots, x_{(N)}$ denote the method="mle"
)
For Type I left censored data, the likelihood function is given by:
$$L(\mu, \sigma | \underline{x}) = {N \choose c_1 c_2 \ldots c_k n} \prod_{j=1}^k [F(T_j)]^{c_j} \prod_{i \in \Omega} f[x_{(i)}] \;\;\;\;\;\; (3)$$
where $f$ and $F$ denote the probability density function (pdf) and
cumulative distribution function (cdf) of the population. That is,
$$f(t) = \phi(\frac{t-\mu}{\sigma}) \;\;\;\;\;\; (4)$$
$$F(t) = \Phi(\frac{t-\mu}{\sigma}) \;\;\;\;\;\; (5)$$
where $\phi$ and $\Phi$ denote the pdf and cdf of the standard normal
distribution, respectively (Cohen, 1963; 1991, pp.6, 50). For left singly
censored data, Equation (3) simplifies to:
$$L(\mu, \sigma | \underline{x}) = {N \choose c} [F(T)]^{c} \prod_{i = c+1}^n f[x_{(i)}] \;\;\;\;\;\; (6)$$
Similarly, for Type I right censored data, the likelihood function is given by:
$$L(\mu, \sigma | \underline{x}) = {N \choose c_1 c_2 \ldots c_k n} \prod_{j=1}^k [1 - F(T_j)]^{c_j} \prod_{i \in \Omega} f[x_{(i)}] \;\;\;\;\;\; (7)$$
and for right singly censored data this simplifies to:
$$L(\mu, \sigma | \underline{x}) = {N \choose c} [1 - F(T)]^{c} \prod_{i = 1}^n f[x_{(i)}] \;\;\;\;\;\; (8)$$
The maximum likelihood estimators are computed by maximizing the likelihood
function. For right-censored data, Cohen (1963; 1991, pp.50-51) shows that
taking partial derivatives of the log-likelihood function with respect to
$\mu$ and $\sigma$ and setting these to 0 produces the following two
simultaneous equations:
$$\bar{x} - \mu = -\sigma \sum_{i=1}^k (\frac{c_j}{n}) Q_j \;\;\;\;\;\; (9)$$
$$s^2 + (\bar{x} - \mu)^2 = \sigma^2[1 - \sum_{j=1}^k \zeta_j (\frac{c_j}{n}) Q_j] \;\;\;\;\;\; (10)$$
where
$$\bar{x} = \frac{1}{n} \sum_{i \in \Omega} x_{(i)} \;\;\;\;\;\; (11)$$
$$s^2 = \frac{1}{n} \sum_{i \in \Omega} (x_{(i)} - \bar{x})^2 \;\;\;\;\;\; (12)$$
$$Q_j = Q(\zeta_j) \;\;\;\;\;\; (13)$$
$$\zeta_j = \frac{T_j - \mu}{\sigma} \;\;\;\;\;\; (14)$$
$$Q(t) = \frac{\phi(t)}{1 - \Phi(t)} \;\;\;\;\;\; (15)$$
Note that the quantity defined in Equation (11) is simply the mean of the uncensored
observations, the quantity defined in Equation (12) is simply the method of
moments estimator of variance based on the uncensored observations, and the
function $Q()$ defined in Equation (15) is the hazard function for the
standard normal distribution.
For left-censored data, Equations (9) and (10) stay the same, except $\zeta$ is
replaced with $-\zeta$.
The function enormCensored
computes the maximum likelihood estimators by
solving Equations (9) and (10) and uses the quantile-quantile regression
estimators (see below) as initial values.
Quantile-Quantile Regression (method="qq.reg"
)
This method is sometimes called the probability plot method
(Nelson, 1982, Chapter 3; Gilbert, 1987, pp.134-136;
Helsel and Hirsch, 1992, p. 361), and more recently also called
parametric regression on order statistics or ROS
(USEPA, 2009; Helsel, 2012). In the case of no censoring, it is well known
(e.g., Nelson, 1982, p.113; Cleveland, 1993, p.31) that for the standard
normal (Gaussian) quantile-quantile plot (i.e., the plot of the sorted observations
(empirical quantiles) versus standard normal quantiles; see qqPlot
),
the intercept and slope of the fitted least-squares line estimate the mean and
standard deviation, respectively. Specifically, the estimates of $\mu$ and
$\sigma$ are found by computing the least-squares estimates in the following
model:
$$x_{(i)} = \mu + \sigma \Phi^{-1}(p_i) + \epsilon_i, \; i = 1, 2, \ldots, N \;\;\;\;\;\; (16)$$
where
$$p_i = \frac{i-a}{N - 2a + 1} \;\;\;\;\;\; (17)$$
denotes the plotting position associated with the $i$'th largest value,
$a$ is a constant such that $0 \le a \le 1$ (the plotting position constant),
and $\Phi$ denotes the cumulative distribution function (cdf) of the standard
normal distribution. The default value of $a$ is 0.375 (see below).
This method can be adapted to the case of left (right) singly censored data
as follows. Plot the $n$ uncensored observations against the $n$
largest (smallest) normal quantiles, where the normal quantiles are computed
based on a sample size of $N$, fit the least-squares line to this plot, and
estimate the mean and standard deviation from the intercept and slope, respectively.
That is, use Equations (16) and (17), but for right singly censored data use
$i = 1, 2, \ldots, n$, and for left singly censored data use
$i = (c+1), (c+2), \ldots, N$.
The argument plot.pos.con
(see the entry for ...in the ARGUMENTS
section above) determines the value of the plotting positions computed in
Equation (18). The default value is plot.pos.con=0.375
.
See the help file for qqPlot
for more information.
This method is discussed by Haas and Scheff (1990). In the context of
lognormal data, Travis and Land (1990) suggest exponentiating the
predicted 50'th percentile from this fit to estimate the geometric mean
(i.e., the median of the lognormal distribution).
This method is easily extended to multiply censored data. Equation (16) becomes
$$x_{(i)} = \mu + \sigma \Phi^{-1}(p_i) + \epsilon_i, \; i \in \Omega \;\;\;\;\;\; (18)$$
where $\Omega$ denotes the set of $n$ subscripts associated with the
uncensored observations in the ordered sample. The plotting positions are
computed by calling the ppointsCensored
.
The argument prob.method
determines the method of computing the plotting
positions (default is prob.method="hirsch-stedinger"
), and the argument
plot.pos.con
determines the plotting position constant (default is
plot.pos.con=0.375
). (See the entry for ...in the ARGUMENTS section above.)
Both Helsel (2012) and USEPA (2009) also use the Hirsch-Stedinger probability
method but set the plotting position constant to 0.
Imputation Using Quantile-Quantile Regression (method="impute.w.qq.reg"
)
This method is also called robust ROS (USEPA, 2009; Helsel, 2012). It
involves using the quantile-quantile regression method (method="qq.reg"
)
to fit a regression line (and thus initially estimate the mean and standard
deviation), and then imputing the values of the
censored observations by predicting them from the regression equation.
The final estimates of the mean and standard deviation are then computed using
the usual formulas (see enorm
) based on the observed and imputed
values.
The imputed values are computed as:
$$\hat{x}_{(i)} = \hat{\mu}_{qqreg} + \hat{\sigma}_{qqreg} \Phi^{-1}(p_i), \; i \not \in \Omega \;\;\;\;\;\; (19)$$
See the help file for ppointsCensored
for information on how the
plotting positions for the censored observations are computed.
The argument prob.method
determines the method of computing the plotting
positions (default is prob.method="hirsch-stedinger"
), and the argument
plot.pos.con
determines the plotting position constant (default is
plot.pos.con=0.375
). (See the entry for ...in the ARGUMENTS section above.)
Both Helsel (2012) and USEPA (2009) also use the Hirsch-Stedinger probability
method but set the plotting position constant to 0.
The arguments lb.impute
and ub.impute
determine the lower and upper
bounds for the imputed values. Imputed values smaller than lb.impute
are
set to this value. Imputed values larger than ub.impute
are set to this
value. The default values are lb.impute=-Inf
and ub.impute=Inf
.
See the entry for ...in the ARGUMENTS section above.
For singly censored data, this is the NR method of Gilliom and Helsel
(1986, p. 137). In the context of lognormal data, this method is discussed by
Hashimoto and Trussell (1983), Gilliom and Helsel (1986), and El-Shaarawi (1989),
and is referred to as the LR or Log-Probability Method.
For multiply censored data, this method was developed in the context of
lognormal data by Helsel and Cohn (1988) using the formulas for plotting
positions given in Hirsch and Stedinger (1987) and Weibull plotting positions
(i.e., prob.method="hirsch-stedinger"
and plot.pos.con=0
).
Setting Censored Observations to Half the Censoring Level (method="half.cen.level"
)
This method is applicable only to left censored data that is bounded below by 0.
This method involves simply replacing all the censored observations with half their
detection limit, and then computing the mean and standard deviation with the usual
formulas (see enorm
).
This method is included only to allow comparison of this method to other methods.
Setting left-censored observations to half the censoring level is not
recommended.
For singly censored data, this method is discussed by Gleit (1985),
Haas and Scheff (1990), and El-Shaarawi and Esterby (1992).
El-Shaarawi and Esterby (1992) show that these estimators are biased and
inconsistent (i.e., the bias remains even as the sample size increases).
For multiply censored data, this method was studied by Helsel and Cohn
(1988).
Estimation Methods for Singly Censored Data
The following methods are available only for singly censored data.
Bias-Corrected Maximum Likelihood Estimation (method="bcmle"
)
The maximum likelihood estimates of $\mu$ and $\sigma$ are biased.
The bias tends to 0 as the sample size increases, but it can be considerable
for small sample sizes, especially in the case of a large percentage of
censored observations (Saw, 1961b). Schmee et al. (1985) note that bias and
variances of the mle's are of the order $1/N$ (see for example,
Bain and Engelhardt, 1991), and that for 90% censoring the bias is negligible
if $N$ is at least 100. (For less intense censoring, even fewer
observations are needed.)
The exact bias of each estimator is extremely difficult to compute.
Saw (1961b), however, derived the first-order term (i.e., the term of order
$1/N$) in the bias of the mle's of $\mu$ and $\sigma$ and
proposed bias-corrected mle's. His bias-corrected estimators were derived for
the case of Type II singly censored data. Schneider (1986, p.110) and
Haas and Scheff (1990), however, state that this bias correction should
reduce the bias of the estimators in the case of Type I censoring as well.
Based on the tables of bias-correction terms given in Saw (1961b),
Schneider (1986, pp.107-110) performed a least-squares fit to produce the
following computational formulas for right-censored data:
$$B_{\mu} = -exp[2.692 - 5.493 \frac{n}{N+1}] \;\;\;\;\;\; (20)$$
$$B_{\sigma} = -[0.312 + 0.859 \frac{n}{N+1}]^{-2} \;\;\;\;\;\; (21)$$
$$\hat{\mu}_{bcmle} = \hat{\mu}_{mle} - \frac{\hat{\sigma}_{mle}}{N+1} B_{\mu} \;\;\;\;\;\; (22)$$
$$\hat{\sigma}_{bcmle} = \hat{\sigma}_{mle} - \frac{\hat{\sigma}_{mle}}{N+1} B_{\sigma} \;\;\;\;\;\; (23)$$
For left-censored data, Equation (22) becomes:
$$\hat{\mu}_{bcmle} = \hat{\mu}_{mle} + \frac{\hat{\sigma}_{mle}}{N+1} B_{\mu} \;\;\;\;\;\; (22)$$
Quantile-Quantile Regression Including the Censoring Level (method="qq.reg.w.cen.level"
)
This is a modification of the quantile-quantile regression method and was proposed
by El-Shaarawi (1989) in the context of lognormal data. El-Shaarawi's idea is to
include the censoring level and an associated plotting position, along with the
uncensored observations and their associated plotting positions, in order to include
information about the value of the censoring level $T$.
For left singly censored data, the modification involves adding the point
$[\Phi^{-1}(p_c), T]$ to the plot before fitting the least-squares line.
For right singly censored data, the point $[\Phi^{-1}(p_{n+1}), T]$
is added to the plot before fitting the least-squares line.
El-Shaarawi (1989) also proposed replacing the estimated normal quantiles with the
exact expected values of normal order statistics, and using the values in their
variance-covariance matrix to perform a weighted least least-squared regression.
These last two modifications are not incorporated here.
Imputation Using Quantile-Quantile Regression Including the Censoring Level
(method ="impute.w.qq.reg.w.cen.level"
)
This is exactly the same method as imputation using quantile-quantile regression
(method="impute.w.qq.reg"
), except that the quantile-quantile regression
including the censoring level method (method="qq.reg.w.cen.level"
) is used
to fit the regression line. In the context of lognormal data, this method is
discussed by El-Shaarawi (1989), which he denotes as the Modified LR Method.
Imputation Using Maximum Likelihood (method ="impute.w.mle"
)
This is exactly the same method as imputation with quantile-quantile regression
(method="impute.w.qq.reg"
), except that the maximum likelihood method
(method="mle"
) is used to compute the initial estimates of the mean and
standard deviation. In the context of lognormal data, this method is discussed
by El-Shaarawi (1989), which he denotes as the Modified Maximum Likelihood Method.
Iterative Imputation Using Quantile-Quantile Regression (method="iterative.impute.w.qq.reg"
)
This method is similar to the imputation with quantile-quantile regression method
(method="impute.w.qq.reg"
), but iterates until the estimates of the mean
and standard deviation converge. The algorithm is:
"impute.w.qq.reg"
method. (Actually, any suitable estimates will do.)enorm
).tol
andconvergence
;
see the entry for...in the ARGUMENTS section above).method="m.est"
)
This method was contributed by Leo R. Korn (Korn and Tyler, 2001).
This method finds location and scale estimates that are consistent at the
normal model and robust to deviations from the normal model, including both
outliers on the right and outliers on the left above and below the limit of
detection. The estimates are found by solving the simultaneous equations:
$$\sum_{i=1}^c h_{\nu} (\frac{T-\mu}{\sigma}) + \sum_{i=c+1}^N \psi_{\nu} (\frac{x_i - \mu}{\sigma}) = 0 \;\;\;\;\;\; (23)$$
$$\sum_{i=1}^c \lambda_{\nu} (\frac{T-\mu}{\sigma}) + \sum_{i=c+1}^N \chi_{\nu} (\frac{x_i - \mu}{\sigma}) = 0 \;\;\;\;\;\; (24)$$
where
$$H_{\nu}(r) = -log[F_{\nu}(r)] \;\;\;\;\;\; (25)$$
$$h_{\nu}(r) = \frac{d}{dr} H_{\nu}(r) = H'_{\nu}(r) \;\;\;\;\;\; (26)$$
$$\rho_{\nu}(r) = -log[f_{\nu}(r)] \;\;\;\;\;\; (27)$$
$$\psi_{\nu}(r) = \frac{d}{dr} \rho_{\nu}(r) = \rho'_{\nu}(r) \;\;\;\;\;\; (28)$$
$$\lambda_{\nu}(r) = r h_{\nu}(r) \;\;\;\;\;\; (29)$$
$$\chi_{\nu}(r) = r \psi_{\nu}(r) - 1 \;\;\;\;\;\; (30)$$
and $f_{\nu}$ and $F_{\nu}$ denote the probability density function
(pdf) and cumulative distribution function (cdf) of
Student's t-distribution with $\nu$ degrees of freedom.
This results in an M-estimating equation based on the t-density function
(Korn and Tyler., 2001). Since the t-density has heavier tails than the normal
density, this M-estimator will tend to down-weight values that are far away from
the center of the data. When censoring is present, neither the location nor the
scale estimates are consistent at the normal model. A computational correction
is performed that converts the above M-estimator to another M-estimator that is
consistent at the normal model, even under censoring.
The degrees of freedom parameter $\nu$ is set by the argument t.df
and
may be viewed as a tuning parameter that will determine the robustness and
efficiency properties. When t.df
is large, the estimator is similar to
the usual mle and the output will then be very close to that when method="mle"
.
As t.df
decreases, the efficiency will decline and the outlier rejection
property will increase in strength. Choosing t.df=3
(the default) provides
a good combination of efficiency and robustness. A reasonable strategy is to
transform the data so that they are approximately symmetric (often the log
transformation for environmental data is appropriate) and then apply the
M-estimator using t.df=3
.
CONFIDENCE INTERVALS
This section explains how confidence intervals for the mean $\mu$ are
computed.
Likelihood Profile (ci.method="profile.likelihood"
)
This method was proposed by Cox (1970, p.88), and Venzon and Moolgavkar (1988)
introduced an efficient method of computation. This method is also discussed by
Stryhn and Christensen (2003) and Royston (2007).
The idea behind this method is to invert the likelihood-ratio test to obtain a
confidence interval for the mean $\mu$ while treating the standard deviation
$\sigma$ as a nuisance parameter. Equation (3) above
shows the form of the likelihood function $L(\mu, \sigma | \underline{x})$ for
multiply left-censored data, and Equation (7) shows the function for
multiply right-censored data.
Following Stryhn and Christensen (2003), denote the maximum likelihood estimates
of the mean and standard deviation by $(\mu^*, \sigma^*)$. The likelihood
ratio test statistic ($G^2$) of the hypothesis $H_0: \mu = \mu_0$
(where $\mu_0$ is a fixed value) equals the drop in $2 log(L)$ between the
ci.method="normal.approx"
)
This method constructs approximate $(1-\alpha)100%$ confidence intervals for
$\mu$ based on the assumption that the estimator of $\mu$ is
approximately normally distributed. That is, a two-sided $(1-\alpha)100%$
confidence interval for $\mu$ is constructed as:
$$[\hat{\mu} - t_{1-\alpha/2, m-1}\hat{\sigma}_{\hat{\mu}}, \; \hat{\mu} + t_{1-\alpha/2, m-1}\hat{\sigma}_{\hat{\mu}}] \;\;\;\; (34)$$
where $\hat{\mu}$ denotes the estimate of $\mu$,
$\hat{\sigma}_{\hat{\mu}}$ denotes the estimated asymptotic standard
deviation of the estimator of $\mu$, $m$ denotes the assumed sample
size for the confidence interval, and $t_{p,\nu}$ denotes the $p$'th
quantile of Student's t-distribuiton with $\nu$
degrees of freedom. One-sided confidence intervals are computed in a
similar fashion.
The argument ci.sample.size
determines the value of $m$ (see
see the entry for ...in the ARGUMENTS section above).
When method
equals
"mle"
or "bcmle"
, the default value is the expected number of
uncensored observations, otherwise it is the observed number of
uncensored observations. This is simply an ad-hoc method of constructing
confidence intervals and is not based on any published theoretical results.
When pivot.statistic="z"
, the $p$'th quantile from the
standard normal distribution is used in place of the
$p$'th quantile from Student's t-distribution.
Approximate Confidence Interval Based on Maximum Likelihood Estimators
When method="mle"
, the standard deviation of the mle of $\mu$ is
estimated based on the inverse of the Fisher Information matrix. The estimated
variance-covariance matrix for the estimates of $\mu$ and $\sigma$
are based on the observed information matrix, formulas for which are given in
Cohen (1991).
Approximate Confidence Interval Based on Bias-Corrected Maximum Likelihood Estimators
When method="bcmle"
(available only for singly censored data),
the same procedures are used to construct the
confidence interval as for method="mle"
. The true variance of the
bias-corrected mle of $\mu$ is necessarily larger than the variance of the
mle of $\mu$ (although the differences in the variances goes to 0 as the
sample size gets large). Hence this method of constructing a confidence interval
leads to intervals that are too short for small sample sizes, but these intervals
should be better centered about the true value of $\mu$.
Approximate Confidence Interval Based on Other Estimators
When method is some value other than "mle"
, the standard deviation of the
estimated mean is approximated by
$$\hat{\sigma}_{\hat{\mu}} = \frac{\hat{\sigma}}{\sqrt{m}} \;\;\;\;\;\; (35)$$
where, as already noted, $m$ denotes the assumed sample size. This is simply
an ad-hoc method of constructing confidence intervals and is not based on any
published theoretical results.
Normal Approximation Using Covariance (ci.method="normal.approx.w.cov"
)
This method is only available for singly censored data and only applicable when
method="mle"
or method="bcmle"
. It was proposed by Schneider
(1986, pp. 191-193) for the case of Type II censoring, but is applicable to any
situation where the estimated mean and standard deviation are consistent estimators
and are correlated. In particular, the mle's of $\mu$ and $\sigma$ are
correlated under Type I censoring as well.
Schneider's idea is to determine two positive quantities $z_1, z_2$ such that
$$Pr(\hat{\mu} + z_1\hat{\sigma} < \mu) = \frac{\alpha}{2} \;\;\;\;\;\; (36)$$
$$Pr(\hat{\mu} - z_2\hat{\sigma} > \mu) = \frac{\alpha}{2} \;\;\;\;\;\; (37)$$
so that
$$[\hat{\mu} - z_2\hat{\sigma}, \; \hat{\mu} + z_1\hat{\sigma}] \;\;\;\;\;\; (38)$$
is a $(1-\alpha)100%$ confidence interval for $\mu$.
For cases where the estimators of $\mu$ and $\sigma$ are independent
(e.g., complete samples), it is well known that setting
$$z_1 = z_2 = \frac{t_{1-\alpha/2, N}}{\sqrt{N}} \;\;\;\;\;\; (39)$$
yields an exact confidence interval and setting
$$z_1 = z_2 = \frac{z_{1-\alpha/2}}{\sqrt{N}} \;\;\;\;\;\; (40)$$
where $z_p$ denotes the $p$'th quantile of the standard normal distribution
yields an approximate confidence interval that is asymptotically correct.
For the general case, Schneider (1986) considers the random variable
$$W(z) = \hat{\mu} + z\hat{\sigma} \;\;\;\;\;\; (41)$$
and provides formulas for $z_1$ and $z_2$.
Note that the resulting confidence interval for the mean is not symmetric about
the estimated mean. Also note that the quantity $m$ is a random variable for
Type I censoring, while Schneider (1986) assumed it to be fixed since he derived
the result for Type II censoring (in which case $m=n$).
Bootstrap and Bias-Corrected Bootstrap Approximation (ci.method="bootstrap"
)
The bootstrap is a nonparametric method of estimating the distribution
(and associated distribution parameters and quantiles) of a sample statistic,
regardless of the distribution of the population from which the sample was drawn.
The bootstrap was introduced by Efron (1979) and a general reference is
Efron and Tibshirani (1993).
In the context of deriving an approximate $(1-\alpha)100%$ confidence interval
for the population mean $\mu$, the bootstrap can be broken down into the
following steps:
enormCensored
, the number of bootstraps$B$is
determined by the argumentn.bootstraps
(see the section ARGUMENTS above).
The default value ofn.bootstraps
is1000
.ecdfPlot
), and then create a confidence interval for$\mu$based on this estimated cdf.enormCensored
calls the Rfunction quantile
to compute the empirical quantiles used in Equations (42)-(44).
The percentile method bootstrap confidence interval is only first-order
accurate (Efron and Tibshirani, 1993, pp.187-188), meaning that the probability
that the confidence interval will contain the true value of $\mu$ can be
off by $k/\sqrt{N}$, where $k$ is some constant. Efron and Tibshirani
(1993, pp.184-188) proposed a bias-corrected and accelerated interval that is
second-order accurate, meaning that the probability that the confidence interval
will contain the true value of $\mu$ may be off by $k/N$ instead of
$k/\sqrt{N}$. The two-sided bias-corrected and accelerated confidence interval is
computed as:
$$[\hat{G}^{-1}(\alpha_1), \; \hat{G}^{-1}(\alpha_2)] \;\;\;\;\;\; (45)$$
where
$$\alpha_1 = \Phi[\hat{z}_0 + \frac{\hat{z}_0 + z_{\alpha/2}}{1 - \hat{a}(z_0 + z_{\alpha/2})}] \;\;\;\;\;\; (46)$$
$$\alpha_2 = \Phi[\hat{z}_0 + \frac{\hat{z}_0 + z_{1-\alpha/2}}{1 - \hat{a}(z_0 + z_{1-\alpha/2})}] \;\;\;\;\;\; (47)$$
$$\hat{z}_0 = \Phi^{-1}[\hat{G}(\hat{\mu})] \;\;\;\;\;\; (48)$$
$$\hat{a} = \frac{\sum_{i=1}^N (\hat{\mu}_{(\cdot)} - \hat{\mu}_{(i)})^3}{6[\sum_{i=1}^N (\hat{\mu}_{(\cdot)} - \hat{\mu}_{(i)})^2]^{3/2}} \;\;\;\;\;\; (49)$$
where the quantity $\hat{\mu}_{(i)}$ denotes the estimate of $\mu$ using
all the values in $\underline{x}$ except the $i$'th one, and
$$\hat{\mu}{(\cdot)} = \frac{1}{N} \sum_{i=1}^N \hat{\mu_{(i)}} \;\;\;\;\;\; (50)$$
A one-sided lower confidence interval is given by:
$$[\hat{G}^{-1}(\alpha_1), \; \infty] \;\;\;\;\;\; (51)$$
and a one-sided upper confidence interval is given by:
$$[-\infty, \; \hat{G}^{-1}(\alpha_2)] \;\;\;\;\;\; (52)$$
where $\alpha_1$ and $\alpha_2$ are computed as for a two-sided confidence
interval, except $\alpha/2$ is replaced with $\alpha$ in Equations (46) and (47).
The constant $\hat{z}_0$ incorporates the bias correction, and the constant
$\hat{a}$ is the acceleration constant. The term ci.method="bootstrap"
, the function enormCensored
computes both
the percentile method and bias-corrected and accelerated method
bootstrap confidence intervals.
This method of constructing confidence intervals for censored data was studied by
Shumway et al. (1989).
Generalized Pivotal Quantity (ci.method="gpq"
)
This method was introduced by Schmee et al. (1985) and is discussed by
Krishnamoorthy and Mathew (2009). The idea is essentially to use a parametric
bootstrap to estimate the correct pivotal quantities $z_1$ and $z_2$ in
Equation (38) above. For singly censored data, these quantities are computed as
follows:
c
observations to be censored.method
argument, and denote these estimates as$\hat{\mu}^*, \; \hat{\sigma}^*$.nmc
times to produce an empirical distribution of
the t-like pivotal quantity.enormCensored
calls the function
gpqCiNormSinglyCensored
to generate the distribution of
pivotal quantities in the case of singly censored data.
A two-sided $(1-\alpha)100%$ confidence interval for $\mu$ is then
computed as:
$$[\hat{\mu} - \hat{t}_{1-(\alpha/2)} \hat{\sigma}, \; \hat{\mu} - \hat{t}_{\alpha/2} \hat{\sigma}] \;\;\;\;\;\; (49)$$
where $\hat{t}_p$ denotes the $p$'th empirical quantile of the
nmc
generated $\hat{t}$ values.
Schmee at al. (1985) derived this method in the context of Type II singly censored
data (for which these limits are exact within Monte Carlo error), but state that
according to Regal (1982) this method produces confidence intervals that are
close apporximations to the correct limits for Type I censored data.
For multiply censored data, this method has been extended as follows. The
algorithm stays the same, except that Step 2 becomes:
2. Set the $i$'th ordered generated observation to be censored or not censored
according to whether the $i$'th observed observation in the original data
is censored or not censored.
The function enormCensored
calls the function
gpqCiNormMultiplyCensored
to generate the distribution of
pivotal quantities in the case of multiply censored data.enorm
, estimateCensored.object
.# Chapter 15 of USEPA (2009) gives several examples of estimating the mean
# and standard deviation of a lognormal distribution on the log-scale using
# manganese concentrations (ppb) in groundwater at five background wells.
# In EnvStats these data are stored in the data frame
# EPA.09.Ex.15.1.manganese.df.
# Here we will estimate the mean and standard deviation using the MLE,
# Q-Q regression (also called parametric regression on order statistics
# or ROS; e.g., USEPA, 2009 and Helsel, 2012), and imputation with Q-Q
# regression (also called robust ROS).
# We will log-transform the original observations and then call
# enormCensored. Alternatively, we could have more simply called
# elnormCensored.
# First look at the data:
#-----------------------
EPA.09.Ex.15.1.manganese.df
# Sample Well Manganese.Orig.ppb Manganese.ppb Censored
#1 1 Well.1 <5 5.0 TRUE
#2 2 Well.1 12.1 12.1 FALSE
#3 3 Well.1 16.9 16.9 FALSE
#...
#23 3 Well.5 3.3 3.3 FALSE
#24 4 Well.5 8.4 8.4 FALSE
#25 5 Well.5 <2 2.0 TRUE
longToWide(EPA.09.Ex.15.1.manganese.df,
"Manganese.Orig.ppb", "Sample", "Well",
paste.row.name = TRUE)
# Well.1 Well.2 Well.3 Well.4 Well.5
#Sample.1 <5 <5 <5 6.3 17.9
#Sample.2 12.1 7.7 5.3 11.9 22.7
#Sample.3 16.9 53.6 12.6 10 3.3
#Sample.4 21.6 9.5 106.3 <2 8.4
#Sample.5 <2 45.9 34.5 77.2 <2
# Now estimate the mean and standard deviation on the log-scale
# using the MLE:
#---------------------------------------------------------------
with(EPA.09.Ex.15.1.manganese.df,
enormCensored(log(Manganese.ppb), Censored))
#Results of Distribution Parameter Estimation
#Based on Type I Censored Data
#--------------------------------------------
#
#Assumed Distribution: Normal
#
#Censoring Side: left
#
#Censoring Level(s): 0.6931472 1.6094379
#
#Estimated Parameter(s): mean = 2.215905
# sd = 1.356291
#
#Estimation Method: MLE
#
#Data: log(Manganese.ppb)
#
#Censoring Variable: Censored
#
#Sample Size: 25
#
#Percent Censored: 24%
# Now compare the MLE with the estimators based on
# Q-Q regression and imputation with Q-Q regression
#--------------------------------------------------
with(EPA.09.Ex.15.1.manganese.df,
enormCensored(log(Manganese.ppb), Censored))$parameters
# mean sd
#2.215905 1.356291
with(EPA.09.Ex.15.1.manganese.df,
enormCensored(log(Manganese.ppb), Censored,
method = "qq.reg"))$parameters
# mean sd
#2.293742 1.283635
with(EPA.09.Ex.15.1.manganese.df,
enormCensored(log(Manganese.ppb), Censored,
method = "impute.w.qq.reg"))$parameters
# mean sd
#2.298656 1.238104
#----------
# The method used to estimate quantiles for a Q-Q plot is
# determined by the argument prob.method. For the functions
# enormCensored and elnormCensored, for any estimation
# method that involves Q-Q regression, the default value of
# prob.method is "hirsch-stedinger" and the default value for the
# plotting position constant is plot.pos.con=0.375.
# Both Helsel (2012) and USEPA (2009) also use the Hirsch-Stedinger
# probability method but set the plotting position constant to 0.
with(EPA.09.Ex.15.1.manganese.df,
enormCensored(log(Manganese.ppb), Censored,
method = "impute.w.qq.reg", plot.pos.con = 0))$parameters
# mean sd
#2.277175 1.261431
#----------
# Using the same data as above, compute a confidence interval
# for the mean on the log-scale using the profile-likelihood
# method.
with(EPA.09.Ex.15.1.manganese.df,
enormCensored(log(Manganese.ppb), Censored, ci = TRUE))
#Results of Distribution Parameter Estimation
#Based on Type I Censored Data
#--------------------------------------------
#
#Assumed Distribution: Normal
#
#Censoring Side: left
#
#Censoring Level(s): 0.6931472 1.6094379
#
#Estimated Parameter(s): mean = 2.215905
# sd = 1.356291
#
#Estimation Method: MLE
#
#Data: log(Manganese.ppb)
#
#Censoring Variable: Censored
#
#Sample Size: 25
#
#Percent Censored: 24%
#
#Confidence Interval for: mean
#
#Confidence Interval Method: Profile Likelihood
#
#Confidence Interval Type: two-sided
#
#Confidence Level: 95%
#
#Confidence Interval: LCL = 1.595062
# UCL = 2.771197
Run the code above in your browser using DataLab