If y
is not supplied, the vector x
is assumed to be a sample from the probability
distribution specified by the argument distribution
(and param.list
if
estimate.params=FALSE
). When plot.type="Q-Q"
, the quantiles of x
are
plotted on the \(y\)-axis against the quantiles of the assumed distribution on the \(x\)-axis.
If y
is supplied and plot.type="Q-Q"
, the empirical quantiles of y
are
plotted against the empirical quantiles of x
.
When plot.type="Tukey Mean-Difference Q-Q"
, the difference of the quantiles is plotted on
the \(y\)-axis against the mean of the quantiles on the \(x\)-axis.
Special Distributions
When y
is not supplied and the argument distribution
specifies one of the
following distributions, the function qqPlot
behaves in the manner described below.
"lnorm"
Lognormal Distribution. The log-transformed quantiles are
plotted against quantiles from a Normal (Gaussian) distribution.
"lnormAlt"
Lognormal Distribution (alternative parameterization).
The untransformed quantiles are plotted against quantiles from a
Lognormal distribution.
"lnorm3"
Three-Parameter Lognormal Distribution. The quantiles of
log(x-threshold)
are plotted against quantiles from a Normal (Gaussian) distribution.
The value of threshold
is either specified in the argument param.list
, or,
if estimate.params=TRUE
, then it is estimated.
"zmnorm"
Zero-Modified Normal Distribution. The quantiles of the
non-zero values (i.e., x[x!=0]
) are plotted against quantiles from a Normal
(Gaussian) distribution.
"zmlnorm"
Zero-Modified Lognormal Distribution. The quantiles of the
log-transformed positive values (i.e., log(x[x>0])
) are plotted against quantiles
from a Normal (Gaussian) distribution.
"zmlnormAlt"
Lognormal Distribution (alternative parameterization).
The quantiles of the untransformed positive values (i.e., x[x>0]
) are
plotted against quantiles from a Lognormal distribution.
Explanation of Q-Q Plots
A probability plot or quantile-quantile (Q-Q) plot
is a graphical display invented by Wilk and Gnanadesikan (1968) to compare a
data set to a particular probability distribution or to compare it to another
data set. The idea is that if two population distributions are exactly the same,
then they have the same quantiles (percentiles), so a plot of the quantiles for
the first distribution vs. the quantiles for the second distribution will fall
on the 0-1 line (i.e., the straight line \(y = x\) with intercept 0 and slope 1).
If the two distributions have the same shape and spread but different locations,
then the plot of the quantiles will fall on the line \(y = x + b\)
(parallel to the 0-1 line) where \(b\) denotes the difference in locations.
If the distributions have different locations and differ by a multiplicative
constant \(m\), then the plot of the quantiles will fall on the line
\(y = mx + b\) (D'Agostino, 1986a, p. 25; Helsel and Hirsch, 1986, p. 42).
Various kinds of differences between distributions will yield various kinds of
deviations from a straight line.
Comparing Observations to a Hypothesized Distribution
Let \(\underline{x} = x_1, x_2, \ldots, x_n\) denote the observations
in a random sample of size \(n\) from some unknown distribution with
cumulative distribution function \(F()\), and let
\(x_{(1)}, x_{(2)}, \ldots, x_{(n)}\) denote the ordered observations.
Depending on the particular formula used for the empirical cdf
(see ecdfPlot
), the \(i\)'th order statistic is an
estimate of the \(i/(n+1)\)'th, \((i-0.5)/n\)'th, etc., quantile.
For the moment, assume the \(i\)'th order statistic is an estimate of the
\(i/(n+1)\)'th quantile, that is:
$$\hat{F}[x_{(i)}] = \hat{p}_i = \frac{i}{n+1} \;\;\;\;\;\; (1)$$
so
$$x_{(i)} \approx F^{-1}(\hat{p}_i) \;\;\;\;\;\; (2)$$
If we knew the form of the true cdf \(F\), then the plot of
\(x_{(i)}\) vs. \(F^{-1}(\hat{p}_i)\) would form approximately
a straight line based on Equation (2) above. A probability plot is a plot of
\(x_{(i)}\) vs. \(F_0^{-1}(\hat{p}_i)\), where \(F_0\) denotes the
cdf associated with the hypothesized distribution. The probability plot
should fall roughly on the line \(y=x\) if \(F=F_0\). If \(F\) and \(F_0\)
merely differ by a shift in location and scale, that is, if
\(F[(x - \mu) / \sigma] = F_0(x)\), then the plot should fall roughly on the
line \(y = \sigma x + \mu\).
The quantity \(\hat{p}_i = i/(n+1)\) in Equation (1) above is called the
plotting position for the probability plot. This particular
formula for the plotting position is appealing because it can be shown that
for any continuous distribution
$$E\{F[x_{(i)}]\} = \frac{i}{n+1} \;\;\;\;\;\; (3)$$
(Nelson, 1982, pp. 299-300; Stedinger et al., 1993). That is, the \(i\)'th
plotting position defined as in Equation (1) is the expected value of the true
cdf evaluated at the \(i\)'th order statistic. Many authors and practitioners,
however, prefer to use a plotting position that satisfies:
$$F^{-1}(\hat{p}_i) = E[x_{(i)}] \;\;\;\;\;\; (4)$$
or one that satisfies
$$F^{-1}(\hat{p}_i) = M[x_{(i)}] = F^{-1}\{M[u_{(i)}]\} \;\;\;\;\;\; (5)$$
where \(M[x_{(i)}]\) denotes the median of the distribution of the \(i\)'th
order statistic, and \(u_{(i)}\) denotes the \(i\)'th order statistic in a
random sample of \(n\) uniform (0,1) random variates.
The plotting positions in Equation (4) are often approximated since the expected
value of the \(i\)'th order statistic is often difficult and time-consuming
to compute. Note that these plotting positions will differ for different
distributions.
The plotting positions in Equation (5) were recommended by Filliben (1975) because
they require computing or approximating only the medians of
uniform (0,1) order statistics, no matter what the form
of the assumed cdf \(F_0\). Also, the median may be preferred as a measure of
central tendency because the distributions of most order statistics are skewed.
Most plotting positions can be written as:
$$\hat{p}_i = \frac{i - a}{n - 2a + 1} \;\;\;\;\;\; (6)$$
where \(0 \le a \le 1\) (D'Agostino, 1986a, p.25; Stedinger et al., 1993).
The quantity \(a\) is sometimes called the “plotting position constant”, and
is determined by the argument plot.pos.con
in the function qqPlot
.
The table below, adapted from Stedinger et al. (1993), displays commonly used
plotting positions based on equation (6) for several distributions.
|
|
Distribution |
|
|
|
Often Used |
|
Name |
a |
With |
References |
Weibull |
0 |
Weibull, |
Weibull (1939), |
|
|
Uniform |
Stedinger et al. (1993) |
Median |
0.3175 |
Several |
Filliben (1975), |
|
|
|
Vogel (1986) |
Blom |
0.375 |
Normal |
Blom (1958), |
|
|
and Others |
Looney and Gulledge (1985) |
Cunnane |
0.4 |
Several |
Cunnane (1978), |
|
|
|
Chowdhury et al. (1991) |
Gringorten |
0.44 |
Gumbel |
Gringorton (1963), |
|
|
|
Vogel (1986) |
Hazen |
0.5 |
Several |
Hazen (1914), |
|
|
|
Chambers et al. (1983), |
For moderate and large sample sizes, there is very little difference in
visual appearance of the Q-Q plot for different choices of plotting positions.
Comparing Two Data Sets
Let \(\underline{x} = x_1, x_2, \ldots, x_n\) denote the observations
in a random sample of size \(n\) from some unknown distribution with
cumulative distribution function \(F()\), and let
\(x_{(1)}, x_{(2)}, \ldots, x_{(n)}\) denote the ordered observations. Similarly,
let \(\underline{y} = y_1, y_2, \ldots, y_m\) denote the observations
in a random sample of size \(m\) from some unknown distribution with
cumulative distribution function \(G()\), and let
\(y_{(1)}, y_{(2)}, \ldots, y_{(m)}\) denote the ordered observations.
Suppose we are interested in investigating whether the shape of the distribution
with cdf \(F\) is the same as the shape of the distribution with cdf \(G\)
(e.g., \(F\) and \(G\) may both be normal distributions but differ in mean
and standard deviation).
When \(n = m\), we can visually explore this question by plotting
\(y_{(i)}\) vs. \(x_{(i)}\), for \(i = 1, 2, \ldots, n\).
The values in \(\underline{y}\) are spread out in a certain way depending
on the true distribution: they may be more or less symmetric about some value
(the population mean or median) or they may be skewed to the right or left;
they may be concentrated close to the mean or median (platykurtic) or there may
be several observations “far away” from the mean or median on either side
(leptokurtic). Similarly, the values in \(\underline{x}\) are spread out in a
certain way. If the values in \(\underline{x}\) and \(\underline{y}\) are
spread out in the same way, then the plot of \(y_{(i)}\) vs. \(x_{(i)}\)
will be approximately a straight line. If the cdf \(F\) is exactly the same
as the cdf \(G\), then the plot of \(y_{(i)}\) vs. \(x_{(i)}\) will fall
roughly on the straight line \(y = x\). If \(F\) and \(G\) differ by a
shift in location and scale, that is, if \(F[(x-\mu)/\sigma] = G(x)\), then
the plot will fall roughly on the line \(y = \sigma x + \mu\).
When \(n > m\), a slight adjustment has to be made to produce the plot. Let
\(\hat{p}_1, \hat{p}_2, \ldots, \hat{p}_m\) denote the plotting positions
corresponding to the \(m\) empirical quantiles for the \(y\)'s and let
\(\hat{p}^*_1, \hat{p}^*_2, \ldots, \hat{p}^*_n\) denote the plotting positions
corresponding the \(n\) empirical quantiles for the \(x\)'s. Then we plot
\(y_{(j)}\) vs. \(x^*_{(j)}\) for \(j = 1, 2, \ldots, m\) where
$$x^*_{(j)} = (1 - r) x_{(i)} + r x_{(i+1)} \;\;\;\;\;\; (7)$$
$$r = \frac{\hat{p}_j - \hat{p}^*_i}{\hat{p}^*_{i+1} - \hat{p}^*_i} \;\;\;\;\;\; (8)$$
$$\hat{p}^*_i \le \hat{p}_j \le \hat{p}^*_{i+1} \;\;\;\;\;\; (9)$$
That is, the values for the \(x^*_{(j)}\)'s are determined by linear interpolation
based on the values of the plotting positions for \(\underline{x}\) and
\(\underline{y}\).
A similar adjustment is made when \(n < m\).
Note that the R function qqplot
uses a different method than
the one in Equation (7) above; it uses linear interpolation based on
1:n
and m
by calling the approx
function.