Methods for computing plotting positions for complete data sets
(no censored observations) are discussed in D'Agostino, R.B. (1986a) and
Cleveland (1993). For data sets with censored observations, these methods
must be modified. The function ppointsCensored
allows you to compute
plotting positions based on any of the following methods:
Product-limit method of Kaplan and Meier (1958) (prob.method="kaplan-meier"
).
Hazard plotting method of Nelson (1972) (prob.method="nelson"
).
Generalization of the product-limit method due to Michael and Schucany (1986)
(prob.method="michael-schucany"
) (the default).
Generalization of the product-limit method due to Hirsch and Stedinger (1987)
(prob.method="hirsch-stedinger"
).
Let \(\underline{x}\) denote a random sample of \(N\) observations from
some distribution. Assume \(n\) (\(0 < n < N\)) of these
observations are known and \(c\) (\(c=N-n\)) of these observations are
all censored below (left-censored) or all censored above (right-censored) at
\(k\) fixed censoring levels
$$T_1, T_2, \ldots, T_K; \; K \ge 1 \;\;\;\;\;\; (1)$$
For the case when \(K \ge 2\), the data are said to be Type I
multiply censored. For the case when \(K=1\),
set \(T = T_1\). If the data are left-censored
and all \(n\) known observations are greater
than or equal to \(T\), or if the data are right-censored and all \(n\)
known observations are less than or equal to \(T\), then the data are
said to be Type I singly censored (Nelson, 1982, p.7), otherwise
they are considered to be Type I multiply censored.
Let \(c_j\) denote the number of observations censored below or above censoring
level \(T_j\) for \(j = 1, 2, \ldots, K\), so that
$$\sum_{i=1}^K c_j = c \;\;\;\;\;\; (2)$$
Let \(x_{(1)}, x_{(2)}, \ldots, x_{(N)}\) denote the “ordered” observations,
where now “observation” means either the actual observation (for uncensored
observations) or the censoring level (for censored observations). For
right-censored data, if a censored observation has the same value as an
uncensored one, the uncensored observation should be placed first.
For left-censored data, if a censored observation has the same value as an
uncensored one, the censored observation should be placed first.
Note that in this case the quantity \(x_{(i)}\) does not necessarily represent
the \(i\)'th “largest” observation from the (unknown) complete sample.
Finally, let \(\Omega\) (omega) denote the set of \(n\) subscripts in the
“ordered” sample that correspond to uncensored observations.
Product-Limit Method of Kaplan and Meier (prob.method="kaplan-meier"
)
For complete data sets (no censored observations), the
empirical probabilities estimator of the cumulative distribution
function evaluated at the \(i\)'th ordered observation is given by
(D'Agostino, 1986a, p.8):
$$\hat{F}[x_{(i)}] = \hat{p}_i = \frac{\#[x_j \le x_{(i)}]}{n} \;\;\;\;\;\; (3)$$
where \(\#[x_j \le x_{(i)}]\) denotes the number of observations less than
or equal to \(x_{(i)}\) (see the help file for ecdfPlot
).
Kaplan and Meier (1958) extended this method of computing the empirical cdf to
the case of right-censored data.
Right-Censored Data (censoring.side="right"
)
Let \(S(t)\) denote the survival function evaluated at \(t\), that is:
$$S(t) = 1 - F(t) = Pr(X > t) \;\;\;\;\;\; (4)$$
Kaplan and Meier (1958) show that a nonparametric estimate of the survival
function at the \(i\)'th ordered observation that is not censored
(i.e., \(i \in \Omega\)), is given by:
\(\hat{S}[x_{(i)}]\) |
\(=\) |
\(\widehat{Pr}[X > x_{(i)}]\) |
|
\(=\) |
\(\widehat{Pr}[X > x_{(1)}]\) |
|
|
\(\;\; \widehat{Pr}[X > x_{(2)} | X > x_{(1)}] \;\; \cdots\) |
|
|
\(\;\; \widehat{Pr}[X > x_{(i)} | X > x_{(i-1)}]\) |
where \(n_j\) is the number of observations (uncensored or censored) with values
greater than or equal to \(x_{(j)}\), and \(d_j\) denotes the number of
uncensored observations exactly equal to \(x_{(j)}\) (if there are no tied
uncensored observations then \(d_j\) will equal 1 for all values of \(j\)).
(See also Lee and Wang, 2003, pp. 64--69; Michael and Schucany, 1986). By convention,
the estimate of the survival function at a censored observation is set equal to
the estimated value of the survival function at the largest uncensored observation
less than or equal to that censoring level. If there are no uncensored observations
less than or equal to a particular censoring level, the estimate of the survival
function is set to 1 for that censoring level.
Thus the Kaplan-Meier plotting position at the \(i\)'th ordered observation
that is not censored (i.e., \(i \in \Omega\)), is given by:
$$\hat{p}_i = \hat{F}[x_{(i)}] = 1 - \prod_{j \in \Omega, j \le i} \frac{n_j - d_j}{n_j} \;\;\;\;\;\; (6)$$
The plotting position for a censored observation is set equal to the plotting
position associated with the largest uncensored observation less than or equal to
that censoring level. If there are no uncensored observations less than or equal
to a particular censoring level, the plotting position is set to 0 for that
censoring level.
As an example, consider the following right-censored data set:
$$3, \ge4, \ge4, 5, 5, 6$$
The table below shows how the plotting positions are computed.
\(i\) |
\(x_{(i)}\) |
\(n_i\) |
\(d_i\) |
\(\frac{n_i-d_i}{n_i}\) |
Plotting Position |
1 |
\(3\) |
\(6\) |
\(1\) |
\(5/6\) |
\(1 - (5/6) = 0.167\) |
2 |
\(\ge4\) |
|
|
|
|
3 |
\(\ge4\) |
|
|
|
|
4 |
\(5\) |
\(3\) |
\(2\) |
\(1/3\) |
\(1 - (5/6)(1/3) = 0.722\) |
5 |
\(5\) |
|
|
|
\(0.722\) |
Note that for complete data sets, Equation (6) reduces to Equation (3).
Left-Censored Data (censoring.side="left"
)
Gillespie et al. (2010) give formulas for the Kaplan-Meier estimator for the case of
left-cesoring (censoring.side="left"
). In this case, the plotting position
for the \(i\)'th ordered observation, assuming it is not censored, is computed as:
$$\hat{p}_i = \hat{F}[x_{(i)}] = \prod_{j \in \Omega, j > i} \frac{n_j - d_j}{n_j} \;\;\;\;\;\; (7)$$
where \(n_j\) is the number of observations (uncensored or censored) with values
less than or equal to \(x_{(j)}\), and \(d_j\) denotes the number of
uncensored observations exactly equal to \(x_{(j)}\) (if there are no tied
uncensored observations then \(d_j\) will equal 1 for all values of \(j\)).
The plotting position is equal to 1 for the largest uncensored order statistic.
As an example, consider the following left-censored data set:
$$3, <4, <4, 5, 5, 6$$
The table below shows how the plotting positions are computed.
\(i\) |
\(x_{(i)}\) |
\(n_i\) |
\(d_i\) |
\(\frac{n_i-d_i}{n_i}\) |
Plotting Position |
1 |
\(3\) |
\(1\) |
\(1\) |
\(0/1\) |
\(1(5/6)(3/5) = 0.5\) |
2 |
\(<4\) |
|
|
|
|
3 |
\(<4\) |
|
|
|
|
4 |
\(5\) |
\(5\) |
\(2\) |
\(3/5\) |
\(0.833\) |
5 |
\(5\) |
|
|
|
\(1(5/6) = 0.833\) |
Note that for complete data sets, Equation (7) reduces to Equation (3).
Modified Kaplan-Meier Method (prob.method="modified kaplan-meier"
)
(Left-Censored Data Only.) For left-censored data, the modified Kaplan-Meier
method is the same as the Kaplan-Meier method, except that for the largest
uncensored order statistic, the plotting position is not set to 1 but rather is
set equal to the Blom plotting position: \((N - 0.375)/(N + 0.25)\). This method
is useful, for example, when creating Quantile-Quantile plots.
Hazard Plotting Method of Nelson (prob.method="nelson"
)
(Right-Censored Data Only.) For right-censored data, Equation (5) can be
re-written as:
$$\hat{S}[x_{(i)}] = \prod_{j \in \Omega, j \le i} \frac{N-j}{N-j+1}, \;\; i \in \Omega \;\;\;\;\;\; (8)$$
Nelson (1972) proposed the following formula for plotting positions for
the uncensored observations in the context of estimating the hazard function
(see Michael and Schucany,1986, p.469):
$$\hat{p}_i = \hat{F}[x_{(i)}] = 1 - \prod_{j \in \Omega, j \le i} exp(\frac{-1}{N-j+1}) \;\;\;\;\;\; (9)$$
See Lee and Wang (2003) for more information about the hazard function.
As for the Kaplan and Meier (1958) method, the plotting position for a censored
observation is set equal to the plotting position associated with the largest
uncensored observation less than or equal to that censoring level. If there are
no uncensored observations less than or equal to a particular censoring level,
the plotting position is set to 0 for that censoring level.
Generalization of Product-Limit Method, Michael and Schucany
(prob.method="michael-schucany"
)
For complete data sets, the disadvantage of using Equation (3) above to define
plotting positions is that it implies the largest observed value is the maximum
possible value of the distribution (the \(100\)'th percentile). This may be
satisfactory if the underlying distribution is known to be discrete, but it is
usually not satisfactory if the underlying distribution is known to be continuous.
A more frequently used formula for plotting positions for complete data sets is
given by:
$$\hat{F}[x_{(i)}] = \hat{p}_i = \frac{i - a}{N - 2a + 1} \;\;\;\;\;\; (10)$$
where \(0 \le a \le 1\) (Cleveland, 1993, p. 18; D'Agostino, 1986a, pp. 8,25).
The value of \(a\) is usually chosen so that the plotting positions are
approximately unbiased (i.e., approximate the mean of their distribution) or else
approximate the median value of their distribution (see the help file for
ecdfPlot
). Michael and Schucany (1986) extended this method for
both left- and right-censored data sets.
Right-Censored Data (censoring.side="right"
)
For right-censored data sets, the plotting positions for the uncensored
observations are computed as:
$$\hat{p}_i = 1 - \frac{N-a+1}{N-2a+1} \prod_{j \in \Omega, j \le i} \frac{N-j-a+1}{N-j-a+2} \;\; i \in \Omega \;\;\;\;\;\; (11)$$
Note that the plotting positions proposed by Herd (1960) and Johnson (1964) are a
special case of Equation (11) with \(a=0\). Equation (11) reduces to Equation (10)
in the case of complete data sets. Note that unlike the Kaplan-Meier method,
plotting positions associated with tied uncensored observations are not the same
(just as in the case for complete data using Equation (10)).
As for the Kaplan and Meier (1958) method, for right-censored data the plotting
position for a censored observation is set equal to the plotting position associated
with the largest uncensored observation less than or equal to that censoring level.
If there are no uncensored observations less than or equal to a particular censoring
level, the plotting position is set to 0 for that censoring level.
Left-Censored Data (censoring.side="left"
)
For left-censored data sets the plotting positions are computed as:
$$\hat{p}_i = \frac{N-a+1}{N-2a+1} \prod_{j \in \Omega, j \ge i} \frac{j-a}{j-a+1} \;\; i \in \Omega \;\;\;\;\;\; (12)$$
Equation (12) reduces to Equation (10) in the case of complete data
sets. Note that unlike the Kaplan-Meier method, plotting positions associated with
tied uncensored observations are not the same (just as in the case for complete
data using Equation (10)).
For left-censored data, the plotting position for a censored observation is set
equal to the plotting position associated with the smallest uncensored observation
greater than or equal to that censoring level. If there are no uncensored
observations greater than or equal to a particular censoring level, the plotting
position is set to 1 for that censoring level.
Generalization of Product-Limit Method, Hirsch and Stedinger
(prob.method="hirsch-stedinger"
)
Hirsch and Stedinger (1987) use a slightly different approach than Kaplan and Meier
(1958) and Michael and Schucany (1986) to derive a nonparametric estimate of the
survival function (probability of exceedance) in the context of left-censored data.
First they estimate the value of the survival function at each of the censoring
levels. The value of the survival function for an uncensored observation between
two adjacent censoring levels is then computed by linear interpolation (in the form
of a plotting position). See also Helsel and Cohn (1988).
The discussion below presents an extension of the method of Hirsch and Stedinger
(1987) to the case of right-censored data, and then presents the original derivation
due to Hirsch and Stedinger (1987) for left-censored data.
Right-Censored Data (censoring.side="right"
)
For right-censored data, the survival function is estimated as follows.
For the \(j\)'th censoring level (\(j = 0, 1, \ldots, K\)), write the
value of the survival function as:
\(S(T_j)\) |
\(=\) |
\(Pr[X > T_j]\) |
|
\(=\) |
\(Pr[X > T_{j+1}] + Pr[T_j < X \le T_{j+1}]\) |
|
\(=\) |
\(S(T_{j+1}) + Pr[T_j < X \le T_{j+1} | X > T_j] Pr[X > T_j]\) |
where
$$T_0 = -\infty, \;\;\;\;\;\; (14)$$
$$T_{K+1} = \infty \;\;\;\;\;\; (15)$$
Now set
\(A_j\) |
\(=\) |
# uncensored observations in \((T_j, T_{j+1}] \;\;\;\;\;\; (16)\) |
for \(j = 0, 1, \ldots, K\). Then the method of moments estimator of the
conditional probability in Equation (13) is given by:
$$\widehat{Pr}[T_j < X \le T_{j+1} | X > T_j] = \frac{A_j}{A_j + B_j} \;\;\;\;\;\; (18)$$
Hence, by equations (13) and (18) we have
$$\hat{S}(T_j) = \hat{S}(T_{j+1}) + (\frac{A_j}{A_j + B_j}) \hat{S}(T_{j}) \;\;\;\;\;\; (19)$$
which can be rewritten as:
$$\hat{S}(T_{j+1}) = \hat{S}(T_j) [1 - (\frac{A_j}{A_j + B_j})] \;\;\;\;\;\; (20)$$
Equation (20) can be solved interatively for \(j = 1, 2, \ldots, K\). Note that
$$\hat{S}(T_0) = \hat{S}(-\infty) = S(-\infty) = 1 \;\;\;\;\;\; (21)$$
$$\hat{S}(T_{K+1}) = \hat{S}(\infty) = S(\infty) = 0 \;\;\;\;\;\; (22)$$
Once the values of the survival function at the censoring levels are computed, the
plotting positions for the \(A_j\) uncensored observations in the interval
\((T_J, T_{j+1}]\) (\(j = 0, 1, \ldots, K\)) are computed as
$$\hat{p}_i = [1 - \hat{S}(T_j)] + [\hat{S}(T_j) - \hat{S}(T_{j+1})] \frac{r-a}{A_j - 2a + 1} \;\;\;\;\;\; (23)$$
where \(a\) denotes the plotting position constant, \(0 \le a \le 1\), and
\(r\) denotes the rank of the \(i\)'th observation among the \(A_j\)
uncensored observations in the interval \((T_J, T_{j+1}]\).
(Tied observations are given distinct ranks.)
For the \(c_j\) observations censored at censoring level \(T_j\)
(\(j = 1, 2, \ldots, K\)), the plotting positions are computed as:
$$\hat{p}_i = 1 - [\hat{S}(T_j) \frac{r-a}{c_j - 2a + 1}] \;\;\;\;\;\; (24)$$
where \(r\) denotes the rank of the \(i\)'th observation among the \(c_j\)
observations censored at censoring level \(T_j\). Note that all the
observations censored at the same censoring level are given distinct ranks,
even though there is no way to distinguish between them.
Left-Censored Data (censoring.side="left"
)
For left-censored data, Hirsch and Stedinger (1987) modify the definition of the
survival function as follows:
$$S^*(t) = Pr[X \ge t] \;\;\;\;\;\; (25)$$
For continuous distributions, the functions in Equations (4) and (25) are identical.
Hirsch and Stedinger (1987) show that for the \(j\)'th censoring level
(\(j = 0, 1, \ldots, K\)), the value of the survival function can be written as:
\(S(T_j)\) |
\(=\) |
\(Pr[X \ge T_j]\) |
|
\(=\) |
\(Pr[X \ge T_{j+1}] + Pr[T_j \le X < T_{j+1}]\) |
|
\(=\) |
\(S^*(T_{j+1}) + Pr[T_j \le X < T_{j+1} | X < T_{j+1}] Pr[X < T_{j+1}]\) |
where \(T_0\) and \(T_{K+1}\) are defined in Equations (14) and (15).
Now set
\(A_j\) |
\(=\) |
# uncensored observations in \([T_j, T_{j+1}) \;\;\;\;\;\; (27)\) |
for \(j = 0, 1, \ldots, K\). Then the method of moments estimator of the
conditional probability in Equation (26) is given by:
$$Pr[T_j \le X < T_{j+1} | X < T_{j+1}] = \frac{A_j}{A_j + B_j} \;\;\;\;\;\; (29)$$
Hence, by Equations (26) and (29) we have
$$\hat{S}(T_j) = \hat{S}(T_{j+1}) + (\frac{A_j}{A_j + B_j}) \hat{S}(T_{j}) \;\;\;\;\;\; (30)$$
which can be solved interatively for \(j = 1, 2, \ldots, K\). Note that
$$\widehat{S^*}(T_{K+1}) = \widehat{S^*}(\infty) = S^*(\infty) = 0 \;\;\;\;\;\; (31)$$
$$\widehat{S^*}(T_0) = \widehat{S^*}(-\infty) = S^*(-\infty) = 1 \;\;\;\;\;\; (32)$$
Once the values of the survival function at the censoring levels are computed, the
plotting positions for the \(A_j\) uncensored observations in the interval
\([T_J, T_{j+1})\) (\(j = 0, 1, \ldots, K\)) are computed as
$$\hat{p}_i = [1 - \widehat{S^*}(T_j)] + [\widehat{S^*}(T_j) - \widehat{S^*}(T_{j+1})] \frac{r-a}{A_j - 2a + 1} \;\;\;\;\;\; (33)$$
where \(a\) denotes the plotting position constant, \(0 \le a \le 0.5\), and
\(r\) denotes the rank of the \(i\)'th observation among the \(A_j\)
uncensored observations in the interval \([T_J, T_{j+1})\).
(Tied observations are given distinct ranks.)
For the \(c_j\) observations censored at censoring level \(T_j\)
(\(j = 1, 2, \ldots, K\)), the plotting positions are computed as:
$$\hat{p}_i = [1 - \widehat{S^*}(T_j)] \frac{r-a}{c_j - 2a + 1} \;\;\;\;\;\; (34)$$
where \(r\) denotes the rank of the \(i\)'th observation among the \(c_j\)
observations censored at censoring level \(T_j\). Note that all the
observations censored at the same censoring level are given distinct ranks,
even though there is no way to distinguish between them.