qqPlotCensored: Quantile-Quantile (Q-Q) Plot for Type I Censored Data

Description

Produces a quantile-quantile (Q-Q) plot, also called a probability plot, for Type I censored data.

Usage

qqPlotCensored(x, censored, censoring.side = "left", 
    prob.method = "michael-schucany", plot.pos.con = NULL, 
    distribution = "norm", param.list = list(mean = 0, sd = 1), 
    estimate.params = plot.type == "Tukey Mean-Difference Q-Q", 
    est.arg.list = NULL, plot.type = "Q-Q", plot.it = TRUE, 
    equal.axes = qq.line.type == "0-1" || estimate.params, 
    add.line = FALSE, qq.line.type = "least squares", 
    duplicate.points.method = "standard", points.col = 1, line.col = 1, 
    line.lwd = par("cex"), line.lty = 1, digits = .Options$digits, 
    include.cen = FALSE, cen.pch = ifelse(censoring.side == "left", 6, 2), 
    cen.cex = par("cex"), cen.col = 4, ..., main = NULL, xlab = NULL, 
    ylab = NULL, xlim = NULL, ylim = NULL)

Arguments

numeric vector of observations that is assumed to represent a sample from the hypothesized distribution specifed by distribution. Missing (NA), undefined (NaN), and infinite (Inf, -Inf

censored

numeric or logical vector indicating which values of x are censored. This must be the same length as x. If the mode of censored is "logical", TRUE values correspond to elements of

censoring.side

character string indicating on which side the censoring occurs. The possible values are "left" (the default) and "right".

prob.method

character string indicating what method to use to compute the plotting positions (empirical probabilities). Possible values are "kaplan-meier" (product-limit method of Kaplan and Meier (1958)), "modified kaplan-meier"

plot.pos.con

numeric scalar between 0 and 1 containing the value of the plotting position constant. The default value is plot.pos.con=0.375. See the DETAILS section for more information. This argument is used only if prob.method is

distribution

a character string denoting the distribution abbreviation. The default value is distribution="norm". See the help file for Distribution.df for a list of possible distribution

param.list

a list with values for the parameters of the distribution. The default value is param.list=list(mean=0, sd=1). See the help file for Distribution.df for the names and possible

estimate.params

a logical scalar indicating whether to compute quantiles based on estimating the distribution parameters (estimate.params=TRUE) or using the known distribution parameters specified in param.list (estimate.params=FALSE, th

est.arg.list

a list whose components are optional arguments associated with the function used to estimate the parameters of the assumed distribution (see the section Estimating Distribution Parameters in the help file

plot.type

a character string denoting the kind of plot. Possible values are "Q-Q" (Quantile-Quantile plot, the default) and "Tukey Mean-Difference Q-Q" (Tukey mean-difference Q-Q plot). This argument may be abbreviated (e.g.,

plot.it

a logical scalar indicating whether to create a plot on the current graphics device. The default value is plot.it=TRUE.

equal.axes

a logical scalar indicating whether to use the same range on the $x$- and $y$-axes when plot.type="Q-Q". The default value is TRUE if qq.line.type="0-1" or estimate.params=TRUE, otherwise it is

add.line

a logical scalar indicating whether to add a line to the plot. If add.line=TRUE and plot.type="Q-Q", a line determined by the value of qq.line.type is added to the plot. If add.line=TRUE and <

qq.line.type

character string determining what kind of line to add to the Q-Q plot. Possible values are "least squares" (the default), "0-1" and "robust". For the value "least squares", a least squares line

duplicate.points.method

a character string denoting how to plot points with duplicate $(x,y)$ values. Possible values are "standard" (the default), "jitter", and "number". For the value "standard", a single plotting s

points.col

a numeric scalar or character string determining the color of the points in the plot. The default value is points.col=1. See the entry for col in the help file for par for mo

line.col

a numeric scalar or character string determining the color of the line in the plot. The default value is points.col=1. See the entry for col in the help file for par for more

line.lwd

a numeric scalar determining the width of the line in the plot. The default value is line.lwd=par("cex"). See the entry for lwd in the help file for par for more information.

line.lty

a numeric scalar determining the line type of the line in the plot. The default value is line.lty=1. See the entry for lty in the help file for par for more information. Thi

digits

a scalar indicating how many significant digits to print for the distribution parameters. The default value is digits=.Options$digits.

include.cen

logical scalar indicating whether to include censored values in the plot. The default value is include.cen=FALSE. If include.cen=TRUE, censored values are plotted using the plotting character indicated by the argument <

cen.pch

numeric scalar or character string indicating the plotting character to use to plot censored values. The default value is cen.pch=2 (hollow triangle pointing up) when censoring.side="right", and cen.pch=6 (h

cen.cex

numeric scalar that determines the size of the plotting character used to plot censored values. The default value is the current value of the cex graphics parameter. See the entry for cex in the help file for

cen.col

numeric scalar or character string that determines the color of the plotting character used to plot censored values. The default value is cen.col=4. See the entry for col in the help file for

main, xlab, ylab, xlim, ylim, ...

additional graphical parameters (see par).

Value

qqPlotCensored returns a list with the following components:
xnumeric vector of $x$-coordinates for the plot. When plot.type="Q-Q" these are the quantiles from the theoretical distribution. When plot.type="Tukey Mean-Difference Q-Q" these are the averages of the observed and theoretical quantiles.
ynumeric vector of $y$-coordinates for the plot. When plot.type="Q-Q" these are the observed quantiles (order statistics). When plot.type="Tukey Mean-Difference Q-Q" these are the differences between the observed quantiles (order statistics) and the theoretical quantiles.
Order.Statisticsnumeric vector of the ordered observations. When plot.type="Q-Q" this component is exactly the same as the component y.
Cumulative.Probabilitiesnumeric vector of the plotting positions associated with the order statistics.
Censoredlogical vector indicating which of the ordered observations are censored.
Censoring.Sidecharacter string indicating whether the data are left- or right-censored. This is same value as the argument censoring.side.
Prob.Methodcharacter string indicating what method was used to compute the plotting positions. This is the same value as the argument prob.method.
Optional Component (only present when prob.method="michael-schucany" or prob.method="hirsch-stedinger"):
Plot.Pos.Connumeric scalar containing the value of the plotting position constant that was used. This is the same as the argument plot.pos.con.

Details

The function qqPlotCensored does exactly the same thing as qqPlot (when the argument y is not supplied to qqPlot), except qqPlotCensored calls the function ppointsCensored to compute the plotting positions (estimated cumulative probabilities). The vector x is assumed to be a sample from the probability distribution specified by the argument distribution (and param.list if estimate.params=FALSE). When plot.type="Q-Q", the quantiles of x are plotted on the $y$-axis against the quantiles of the assumed distribution on the $x$-axis. When plot.type="Tukey Mean-Difference Q-Q", the difference of the quantiles is plotted on the $y$-axis against the mean of the quantiles on the $x$-axis. When prob.method="kaplan-meier" and censoring.side="left" and the assumed distribution has a maximum support of infinity (Inf; e.g., the normal or lognormal distribution), the point invovling the largest value of x is not plotted because it corresponds to an estimated cumulative probability of 1 which corresponds to an infinite plotting position. When prob.method="modified kaplan-meier" and censoring.side="left", the estimated cumulative probability associated with the maximum value is modified from 1 to be $(N - .375)/(N + .25)$ where $N$ denotes the sample size (i.e., the Blom plotting position) so that the point associated with the maximum value can be displayed.

References

Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16. Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp. D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62. Gillespie, B.W., Q. Chen, H. Reichert, A. Franzblau, E. Hedgeman, J. Lepkowski, P. Adriaens, A. Demond, W. Luksemburg, and D.H. Garabrant. (2010). Estimating Population Distributions When Some Data Are Below a Limit of Detection by Using a Reverse Kaplan-Meier Estimator. Epidemiology 21(4), S64--S70. Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey. Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997-2004. Hirsch, R.M., and J.R. Stedinger. (1987). Plotting Positions for Historical Floods and Their Precision. Water Resources Research 23(4), 715-727. Kaplan, E.L., and P. Meier. (1958). Nonparametric Estimation From Incomplete Observations. Journal of the American Statistical Association 53, 457-481. Lee, E.T., and J. Wang. (2003). Statistical Methods for Survival Data Analysis, Third Edition. John Wiley and Sons, New York. Michael, J.R., and W.R. Schucany. (1986). Analysis of Data from Censored Samples. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, 560pp, Chapter 11, 461-496. Nelson, W. (1972). Theory and Applications of Hazard Plotting for Censored Failure Data. Technometrics 14, 945-966. USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. Chapter 15. USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.

Examples

Run this code

# Generate 20 observations from a normal distribution with mean=20 and sd=5, 
  # censor all observations less than 18, then generate a Q-Q plot assuming 
  # a normal distribution for the complete data set and the censored data set.  
  # Note that the Q-Q plot for the censored data set starts at the first ordered 
  # uncensored observation, and that for values of x > 18 the two Q-Q plots are 
  # exactly the same.  This is because there is only one censoring level and 
  # no uncensored observations fall below the censored observations. 
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(333) 
  x <- rnorm(20, mean=20, sd=5) 
  censored <- x < 18

  sum(censored) 
  #[1] 7 

  new.x <- x 
  new.x[censored] <- 18

  dev.new()
  qqPlot(x, ylim = range(pretty(x)), 
    main = "Q-Q Plot for
Complete Data Set") 

  dev.new()
  qqPlotCensored(new.x, censored, ylim = range(pretty(x)), 
    main="Q-Q Plot for
Censored Data Set")

  # Clean up
  #---------
  rm(x, censored, new.x)

  #------------------------------------------------------------------------------------

  # Example 15-1 of USEPA (2009, page 15-10) gives an example of
  # computing plotting positions based on censored manganese 
  # concentrations (ppb) in groundwater collected at 5 monitoring
  # wells.  The data for this example are stored in 
  # EPA.09.Ex.15.1.manganese.df.  Here we will create a Q-Q
  # plot based on the Kaplan-Meier method.  First we'll assume 
  # a normal distribution, then a lognormal distribution, then a 
  # gamma distribution.

  EPA.09.Ex.15.1.manganese.df
  #   Sample   Well Manganese.Orig.ppb Manganese.ppb Censored
  #1       1 Well.1                 <5           5.0     TRUE
  #2       2 Well.1               12.1          12.1    FALSE
  #3       3 Well.1               16.9          16.9    FALSE
  #4       4 Well.1               21.6          21.6    FALSE
  #5       5 Well.1                 <2           2.0     TRUE
  #...
  #21      1 Well.5               17.9          17.9    FALSE
  #22      2 Well.5               22.7          22.7    FALSE
  #23      3 Well.5                3.3           3.3    FALSE
  #24      4 Well.5                8.4           8.4    FALSE
  #25      5 Well.5                 <2           2.0     TRUE

  # Assume normal distribution
  #---------------------------

  dev.new()
  with(EPA.09.Ex.15.1.manganese.df, 
    qqPlotCensored(Manganese.ppb, Censored, 
      prob.method = "kaplan-meier", points.col = "blue", add.line = TRUE,
      main = paste("Normal Q-Q Plot of Manganese Data", 
          "Based on Kaplan-Meier Plotting Positions", sep = "")))

  # Include max value in the plot
  #------------------------------

  dev.new()
  with(EPA.09.Ex.15.1.manganese.df, 
    qqPlotCensored(Manganese.ppb, Censored, 
      prob.method = "modified kaplan-meier", points.col = "blue", 
      add.line = TRUE,
      main = paste("Normal Q-Q Plot of Manganese Data", 
          "Based on Kaplan-Meier Plotting Positions",
          "(Max Included)", sep = "")))
  

  # Assume lognormal distribution
  #------------------------------

  dev.new()
  with(EPA.09.Ex.15.1.manganese.df, 
    qqPlotCensored(Manganese.ppb, Censored, dist = "lnorm", 
      prob.method = "kaplan-meier", points.col = "blue", add.line = TRUE,
      main = paste("Lognormal Q-Q Plot of Manganese Data", 
          "Based on Kaplan-Meier Plotting Positions", sep = "")))

  # Include max value in the plot
  #------------------------------

  dev.new()
  with(EPA.09.Ex.15.1.manganese.df, 
    qqPlotCensored(Manganese.ppb, Censored, dist = "lnorm", 
      prob.method = "modified kaplan-meier", points.col = "blue", 
      add.line = TRUE,
      main = paste("Lognormal Q-Q Plot of Manganese Data", 
          "Based on Kaplan-Meier Plotting Positions",
          "(Max Included)", sep = "")))


  # The lognormal distribution appears to be a better fit.
  # Now create a Q-Q plot assuming a gamma distribution.  Here we'll 
  # need to set estimate.params=TRUE.

  dev.new()
  with(EPA.09.Ex.15.1.manganese.df, 
    qqPlotCensored(Manganese.ppb, Censored, dist = "gamma", 
      estimate.params = TRUE, prob.method = "kaplan-meier", 
      points.col = "blue", add.line = TRUE,
      main = paste("Gamma Q-Q Plot of Manganese Data", 
          "Based on Kaplan-Meier Plotting Positions", sep = "")))

  # Include max value in the plot
  #------------------------------

  dev.new()
  with(EPA.09.Ex.15.1.manganese.df, 
    qqPlotCensored(Manganese.ppb, Censored, dist = "gamma", 
      estimate.params = TRUE, prob.method = "modified kaplan-meier", 
      points.col = "blue", add.line = TRUE,
      main = paste("Gamma Q-Q Plot of Manganese Data", 
          "Based on Kaplan-Meier Plotting Positions",
          "(Max Included)", sep = "")))

  #==========

  # Clean up
  #---------
  graphics.off()

Run the code above in your browser using DataLab