ecdfPlot: Empirical Cumulative Distribution Function Plot

Description

Produce an empirical cumulative distribution function plot.

Usage

ecdfPlot(x, discrete = FALSE, 
    prob.method = ifelse(discrete, "emp.probs", "plot.pos"), 
    plot.pos.con = 0.375, plot.it = TRUE, add = FALSE, ecdf.col = "black", 
    ecdf.lwd = 3 * par("cex"), ecdf.lty = 1, curve.fill = FALSE, 
    curve.fill.col = "cyan", ..., type = ifelse(discrete, "s", "l"), 
    main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)

Arguments

numeric vector of observations. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are allowed but will be removed.

discrete

logical scalar indicating whether the assumed parent distribution of x is discrete (discrete=TRUE) or continuous (discrete=FALSE; the default).

prob.method

character string indicating what method to use to compute the plotting positions (empirical probabilities). Possible values are plot.pos (plotting positions, the default if discrete=FALSE) and emp.probs (empirical probabilities, the default if discrete=TRUE). See the DETAILS section for more explanation.

plot.pos.con

numeric scalar between 0 and 1 containing the value of the plotting position constant. The default value is plot.pos.con=0.375. See the DETAILS section for more information. This argument is ignored if prob.method="emp.probs".

plot.it

logical scalar indicating whether to produce a plot or add to the current plot (see add) on the current graphics device. The default value is plot.it=TRUE.

add

logical scalar indicating whether to add the empirical cdf to the current plot (add=TRUE) or generate a new plot (add=FALSE; the default). This argument is ignored if plot.it=FALSE.

ecdf.col

a numeric scalar or character string determining the color of the empirical cdf line or points. The default value is ecdf.col=1. See the entry for col in the help file for par for more information.

ecdf.lwd

a numeric scalar determining the width of the empirical cdf line. The default value is ecdf.lwd=3*par("cex"). See the entry for lwd in the help file for par for more information.

ecdf.lty

a numeric scalar determining the line type of the empirical cdf line. The default value is ecdf.lty=1. See the entry for lty in the help file for par for more information.

curve.fill

a logical scalar indicating whether to fill in the area below the empirical cdf curve with the color specified by curve.fill.col. The default value is curve.fill=FALSE.

curve.fill.col

a numeric scalar or character string indicating what color to use to fill in the area below the empirical cdf curve. The default value is curve.fill.col=5. This argument is ignored if curve.fill=FALSE.

type, main, xlab, ylab, xlim, ylim, …

additional graphical parameters (see lines and par). In particular, the argument type specifies the kind of line type. By default, the function ecdfPlot plots a step function (type="s") when discrete=TRUE, and plots a straight line between points (type="l") when discrete=FALSE. The user may override these defaults by supplying the graphics parameter type (type="s" for a step function, type="l" for linear interpolation, type="p" for points only, etc.).

Value

ecdfPlot invisibly returns a list with the following components:

Order.Statistics

numeric vector of the ordered observations.

Cumulative.Probabilities

numeric vector of the associated plotting positions.

Details

The cumulative distribution function (cdf) of a random variable $X$ is the function $F$ such that $$F(x) = Pr(X \le x) \;\;\;\;\;\; (1)$$ for all values of $x$. That is, if $p = F(x)$, then $p$ is the proportion of the population that is less than or equal to $x$, and $x$ is called the $p$'th quantile, or the 100$p$'th percentile. A plot of quantiles on the $x$-axis (i.e., the possible value for the random variable $X$) vs. the fraction of the population less than or equal to that number on the $y$-axis is called the cumulative distribution function plot, and the $y$-axis is usually labeled as the “cumulative probability” or “cumulative frequency”.

When we have a sample of data from some population, we usually do not know what percentiles our observations correspond to because we do not know the form of the cumulative distribution function $F$, so we have to use the sample data to estimate the cdf $F$. An emprical cumulative distribution function (ecdf) plot, also called a quantile plot, is a plot of the observed quantiles (i.e., the ordered observations) on the $x$-axis vs. the estimated cumulative probabilities on the $y$-axis (Chambers et al., 1983, pp. 11-19; Cleveland, 1993, pp. 17-20; Cleveland, 1994, pp. 136-139; Helsel and Hirsch, 1992, pp. 21-24).

(Note: Some authors (e.g., Chambers et al., 1983, pp.11-16; Cleveland, 1993, pp.17-20) reverse the axes on a quantile plot, i.e., the observed order statistics from the random sample are on the $y$-axis and the estimated cumulative probabilities are on the $x$-axis.)

The empirical cumulative distribution function (ecdf) is an estimate of the cdf based on a random sample of $n$ observations from the distribution. Let $x_1, x_2, \ldots, x_n$ denote the $n$ observations, and let $x_{(1)}, x_{(2)}, \ldots, x_{(n)}$ denote the ordered observations (i.e., the order statistics). The cdf is usually estimated by either the empirical probabilities estimator or the plotting-position estimator. The empirical probabilities estimator is given by: $$\hat{F}[x_{(i)}] = \hat{p}_i = \frac{\#[x_j \le x_{(i)}]}{n} \;\;\;\;\;\; (2)$$ where $\#[x_j \le x_{(i)}]$ denotes the number of observations less than or equal to $x_{(i)}$. The plotting-position estimator is given by: $$\hat{F}[x_{(i)}] = \hat{p}_i = \frac{i - a}{n - 2a + 1} \;\;\;\;\;\; (3)$$ where $0 \le a \le 1$ (Cleveland, 1993, p. 18; D'Agostino, 1986a, pp. 8,25).

For any value $x$ such that $x_{(1)} < x < x_{(n)}$, the ecdf is usually defined as either a step function: $$\hat{F}(x) = \hat{F}[x_{(i)}], \qquad x_{(i)} \le x < x_{(i+1)} \;\;\;\;\;\; (4)$$ (e.g., D'Agostino, 1986a), or linear interpolation between order statistics is used: $$\hat{F}(x) = (1-r)\hat{F}[x_{(i)}] + r\hat{F}[x_{(i+1)}], \qquad x_{(i)} \le x < x_{(i+1)} \;\;\;\;\;\; (5)$$ where $$r = \frac{x - x_{(i)}}{x_{(i+1)} - x_{(i)}} \;\;\;\;\;\; (6)$$ (e.g., Chambers et al., 1983). For the step function version, the ecdf stays flat until it hits a value on the $x$-axis corresponding to one of the order statistics, then it makes a jump. For the linear interpolation version, the ecdf plot looks like lines connecting the points. By default, the function ecdfPlot uses the step function version when discrete=TRUE, and the linear interpolation version when discrete=FALSE. The user may override these defaults by supplying the graphics parameter type (type="s" for a step function, type="l" for linear interpolation, type="p" for points only, etc.).

The empirical probabilities estimator is intuitively appealing. This is the estimator used when prob.method="emp.probs". The disadvantage of this estimator is that it implies the largest observed value is the maximum possible value of the distribution (i.e., the 100'th percentile). This may be satisfactory if the underlying distribution is known to be discrete, but it is usually not satisfactory if the underlying distribution is known to be continuous.

The plotting-position estimator with various values of $a$ is often used when the goal is to produce a probability plot (see qqPlot) rather than an empirical cdf plot. It is used to compute the estimated expected values or medians of the order statistics for a probability plot. This is the estimator used when prob.method="plot.pos". The argument plot.pos.con refers to the variable $a$. Based on certain principles from statistical theory, certain values of the constant $a$ make sense for specific underlying distributions (see the help file for qqPlot for more information).

Because $x$ is a random sample, the emprical cdf changes from sample to sample and the variability in these estimates can be dramatic for small sample sizes.

References

Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.

Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.

D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.

Examples

Run this code

# NOT RUN {
  # Generate 20 observations from a normal distribution with 
  # mean=0 and sd=1 and create an ecdf plot. 
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(250) 
  x <- rnorm(20) 
  dev.new()
  ecdfPlot(x)

  #----------

  # Repeat the above example, but fill in the area under the 
  # empirical cdf curve.

  dev.new()
  ecdfPlot(x, curve.fill = TRUE)

  #----------

  # Repeat the above example, but plot only the points.

  dev.new()
  ecdfPlot(x, type = "p")

  #----------

  # Repeat the above example, but force a step function.

  dev.new()
  ecdfPlot(x, type = "s")

  #----------

  # Clean up
  rm(x)

  #-------------------------------------------------------------------------------------

  # The guidance document USEPA (1994b, pp. 6.22--6.25) 
  # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) 
  # concentrations (in parts per billion) from soil samples 
  # at a Reference area and a Cleanup area.  These data are strored 
  # in the data frame EPA.94b.tccb.df.  
  #
  # Create an empirical CDF plot for the reference area data.
  
  dev.new()
  with(EPA.94b.tccb.df, 
    ecdfPlot(TcCB[Area == "Reference"], xlab = "TcCB (ppb)"))

  #==========

  # Clean up
  #---------
  graphics.off()
# }

Run the code above in your browser using DataLab