gofTest(y, ...)
## S3 method for class 'formula':
gofTest(y, data = NULL, subset,
na.action = na.pass, ...)
## S3 method for class 'default':
gofTest(y, x = NULL,
test = ifelse(is.null(x), "sw", "ks"),
distribution = "norm", est.arg.list = NULL,
alternative = "two.sided", n.classes = NULL,
cut.points = NULL, param.list = NULL,
estimate.params = ifelse(is.null(param.list), TRUE, FALSE),
n.param.est = NULL, correct = NULL, digits = .Options$digits,
exact = NULL, ws.method = "normal scores", warn = TRUE,
data.name = NULL, data.name.x = NULL, parent.of.data = NULL,
subset.expression = NULL, ...)
y
must be numeric vector of observations.
In the formula method, y
must be a formula of the form y ~ 1
or as.data.frame
to a data frame) containing the variables in the
model. If not found in data
, the variables are taken from
environment
NA
s.
The default is na.pass
.test="ks"
).
Missing (NA
), undefined (NaN
), and infinite (Inf
,
-Inf
"sw"
(Shapiro-Wilk; the default when x
is NOT supplied),
"sf"
(Shapiro-Francia),
"ppcc"
(Probability PlDistribution.df
for a list of distributions and their abbreviations.
The default value is distributio
test="sw"
and distribution="gamma"
, setting
est.arg.list=list(method="bcmle")
indicates using the bias-cotest="ks"
, test="skew"
, or test="ws"
,
character string specifying the alternative hypothesis. When test="ks"
or
test="skew"
, the possible values are "two-sid
test="chisq"
, the number of cells into which the observations
are to be allocated. If the argument cut.points
is supplied, then n.classes
is set to length(cut.points)-1
. The dtest="chisq"
, a vector of cutpoints that defines the cells.
The element x[i]
is allocated to cell j
if
cut.points[j]
< x[i]
$\le$ cut.points[j+1]
. If test="ks"
and x
is not supplied, or when
test="chisq"
,
a list with values for the parameters of the specified distribution. See the help file
for
test="ks"
and x
is not supplied, or when
test="chisq"
, a logical scalar indicating whether to perform the goodness-of-fit test based on
estimating the distribution parameters (estimate
test="ks"
and x
is not supplied, or when
test="chisq"
,
an integer indicating the number of parameters estimated from the data.
If estimate.params=TRUE
, the default value is thtest="chisq"
, a logical scalar indicating whether to use the
continuity correction. The default value is correct=FALSE
unless
n.classes=2
.test="ks"
and x
is not supplied, or when
test="chisq"
, and param.list
is supplied,
a scalar indicating how many significant digits to print out for the parameters
associatedtest="ks"
, exact=NULL
by default, but can be set to
a logical scalar indicating whether an exact p-value should be computed.
See the help file for ks.test
test="ws"
, this argument specifies whether to perform the test
based on normal scores (ws.method="normal scores"
, the default) or
chi-square scores (ws.method="chi-square scores"
). See the NA
s, NaN
s, or Inf
s in
y
or x
are removed. The default value is TRUE
.y
.x
.test="sw"
).
The Shapiro-Wilk goodness-of-fit test (Shapiro and Wilk, 1965; Royston, 1992a)
is one of the most commonly used goodness-of-fit tests for normality.
You can use it to test the following hypothesized distributions:Normal,Lognormal,Three-Parameter Lognormal,Zero-Modified Normal, orZero-Modified Lognormal (Delta).In addition, you can also use it to test the null hypothesis of any
continuous distribution that is available(see the help file forDistribution.df
, and see explanation below).
Shapiro-Wilk W-Statistic and P-Value for Testing Normality
Let$X$denote a random variable with cumulative distribution function (cdf)$F$. Suppose we want to test the null hypothesis that$F$is the cdf of
anormal (Gaussian) distributionwith some arbitrary mean$\mu$and standard deviation$\sigma$against the alternative hypothesis
that$F$is the cdf of some other distribution. The table below shows the
random variable for which$F$is the assumed cdf, given the value of the
argumentdistribution
.distribution
Distribution Name which$F$is the cdf
"norm"
Normal $X$
"lnorm"
Lognormal (Log-space) $log(X)$
"lnormAlt"
Lognormal (Untransformed) $log(X)$
"lnorm3"
Three-Parameter Lognormal $log(X-\gamma)$
"zmnorm"
Zero-Modified Normal $X | X > 0$
"zmlnorm"
Zero-Modified Lognormal (Log-space) $log(X) | X > 0$
"zmlnormAlt"
Zero-Modified Lognormal (Untransformed) $log(X) | X > 0$cor
).
The Shapiro-Wilk $W$-statistic is also simply the ratio of two estimators of
variance, and can be rewritten as
$$W = \frac{\hat{\sigma}_{BLUE}^2}{\hat{\sigma}_{MVUE}^2} \;\;\;\;\;\; (10)$$
where the numerator is the square of the best linear unbiased estimate (BLUE) of
the standard deviation, and the denominator is the minimum variance unbiased
estimator (MVUE) of the variance:
$$\hat{\sigma}_{BLUE} = \frac{\sum_{i=1}^n a_i x_i}{\sqrt{n-1}} \;\;\;\;\;\; (11)$$
$$\hat{\sigma}_{MVUE}^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1} \;\;\;\;\;\; (12)$$
Small values of $W$ indicate the null hypothesis is probably not true.
Shapiro and Wilk (1965) computed the values of the coefficients $\underline{a}$
and the percentage points for $W$ (based on smoothing the empirical null
distribution of $W$) for sample sizes up to 50. Computation of the
$W$-statistic for larger sample sizes can be cumbersome, since computation of
the coefficients $\underline{a}$ requires storage of at least
$n + [n(n+1)/2]$ reals followed by $n \times n$ matrix inversion
(Royston, 1992a).
The Shapiro-Francia W'-Statistic
Shapiro and Francia (1972) introduced a modification of the $W$-test that
depends only on the expected values of the order statistics ($\underline{m}$)
and not on the variance-covariance matrix ($V$):
$$W' = \frac{(\sum_{i=1}^n b_i x_i)^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \;\;\;\;\;\; (13)$$
where the quantity $b_i$ is the $i$'th element of the vector
$\underline{b}$ defined by:
$$\underline{b} = \frac{\underline{m}}{[\underline{m}^T \underline{m}]^{1/2}} \;\;\;\;\;\; (14)$$
Several authors, including Ryan and Joiner (1973), Filliben (1975), and Weisberg
and Bingham (1975), note that the $W'$-statistic is intuitively appealing
because it is the squared Pearson correlation coefficient associated with a normal
probability plot. That is, it is the squared correlation between the ordered
sample values $\underline{x}$ and the expected normal order statistics
$\underline{m}$:
$$W' = r(\underline{b}, \underline{x})^2 = r(\underline{m}, \underline{x})^2 \;\;\;\;\;\; (15)$$
Shapiro and Francia (1972) present a table of empirical percentage points for $W'$
based on a Monte Carlo simulation. It can be shown that the asymptotic null
distributions of $W$ and $W'$ are identical, but convergence is very slow
(Verrill and Johnson, 1988).
The Weisberg-Bingham Approximation to the W'-Statistic
Weisberg and Bingham (1975) introduced an approximation of the Shapiro-Francia
$W'$-statistic that is easier to compute. They suggested using Blom scores
(Blom, 1958, pp.68--75) to approximate the element of $\underline{m}$:
$$\tilde{W}' = \frac{(\sum_{i=1}^n c_i x_i)^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \;\;\;\;\;\; (16)$$
where the quantity $c_i$ is the $i$'th element of the vector
$\underline{c}$ defined by:
$$\underline{c} = \frac{\underline{\tilde{m}}}{[\underline{\tilde{m}}^T \underline{\tilde{m}}]^{1/2}} \;\;\;\;\;\; (17)$$
and
$$\tilde{m}_i = \Phi^{-1}[\frac{i - (3/8)}{n + (1/4)}] \;\;\;\;\;\; (18)$$
and $\Phi$ denotes the standard normal cdf. That is, the values of the
elements of $\underline{m}$ in Equation (14) are replaced with their estimates
based on the usual plotting positions for a normal distribution.
Royston's Approximation to the Shapiro-Wilk W-Test
Royston (1992a) presents an approximation for the coefficients $\underline{a}$
necessary to compute the Shapiro-Wilk $W$-statistic, and also a transformation
of the $W$-statistic that has approximately a standard normal distribution
under the null hypothesis.
Noting that, up to a constant, the components of $\underline{b}$ in
Equation (14) and $\underline{c}$ in Equation (17) differ from those of
$\underline{a}$ in Equation (2) mainly in the first and last two components,
Royston (1992a) used the approximation $\underline{c}$ as the basis for
approximating $\underline{a}$ using polynomial (quintic) regression analysis.
For $4 \le n \le 1000$, the approximation gave the following equations for the
last two (and hence first two) components of $\underline{a}$:
$$\tilde{a}_n = c_n + 0.221157 y - 0.147981 y^2 - 2.071190 y^3 + 4.434685 y^4 - 2.706056 y^5 \;\;\;\;\;\; (19)$$
$$\tilde{a}_{n-1} = c_{n-1} + 0.042981 y - 0.293762 y^2 - 1.752461 y^3 + 5.682633 y^4 - 3.582633 y^5 \;\;\;\;\;\; (20)$$
where
$$y = \sqrt{n} \;\;\;\;\;\; (21)$$
The other components are computed as:
$$\tilde{a}_i = \frac{\tilde{m}_i}{\sqrt{\eta}} \;\;\;\;\;\; (22)$$
for $i = 2, \ldots , n-1$ if $n \le 5$, or $i = 3, \ldots, n-2$ if
$n > 5$, where
$$\eta = \frac{\underline{\tilde{m}}^T \underline{\tilde{m}} - 2 \tilde{m}_n^2}{1 - 2 \tilde{a}_n^2} \;\;\;\;\;\; (23)$$
if $n \le 5$, and
$$\eta = \frac{\underline{\tilde{m}}^T \underline{\tilde{m}} - 2 \tilde{m}_n^2 - 2 \tilde{m}_{n-1}^2}{1 - 2 \tilde{a}_n^2 - 2 \tilde{a}_{n-1}^2} \;\;\;\;\;\; (24)$$
if $n > 5$.
Royston (1992a) found his approximation to $\underline{a}$ to be accurate to
at least $\pm 1$ in the third decimal place over all values of $i$ and
selected values of $n$, and also found that critical percentage points of
$W$ based on his approximation agreed closely with the exact critical
percentage points calculated by Verrill and Johnson (1988).
Transformation of the Null Distribution of W to Normality
In order to compute a p-value associated with a particular value of $W$,
Royston (1992a) approximated the distribution of $(1-W)$ by a
three-parameter lognormal distribution for $4 \le n \le 11$,
and the upper half of the distribution of $(1-W)$ by a two-parameter
lognormal distribution for $12 \le n \le 2000$.
Setting
$$z = \frac{w - \mu}{\sigma} \;\;\;\;\;\; (25)$$
the p-value associated with $W$ is given by:
$$p = 1 - \Phi(z) \;\;\;\;\;\; (26)$$
For $4 \le n \le 11$, the quantities necessary to compute $z$ are given by:
$$w = -log[\gamma - log(1 - W)] \;\;\;\;\;\; (27)$$
$$\gamma = -2.273 + 0.459 n \;\;\;\;\;\; (28)$$
$$\mu = 0.5440 - 0.39978 n + 0.025054 n^2 - 0.000671 n^3 \;\;\;\;\;\; (29)$$
$$\sigma = exp(1.3822 - 0.77857 n + 0.062767 n^2 - 0.0020322 n^3) \;\;\;\;\;\; (30)$$
For $12 \le n \le 2000$, the quantities necessary to compute $z$ are given
by:
$$w = log(1 - W) \;\;\;\;\;\; (31)$$
$$\gamma = log(n) \;\;\;\;\;\; (32)$$
$$\mu = -1.5861 - 0.31082 y - 0.083751 y^2 + 0.00038915 y^3 \;\;\;\;\;\; (33)$$
$$\sigma = exp(-0.4803 - 0.082676 y + 0.0030302 y^2) \;\;\;\;\;\; (34)$$
For the last approximation when $12 \le n \le 2000$, Royston (1992a) claims
this approximation is actually valid for sample sizes up to $n = 5000$.
Modification for the Three-Parameter Lognormal Distribution
When distribution="lnorm3"
, the function gofTest
assumes the vector
$\underline{x}$ is a random sample from a
three-parameter lognormal distribution. It estimates the
threshold parameter via the zero-skewness method (see elnorm3
), and
then performs the Shapiro-Wilk goodness-of-fit test for normality on
$log(x-\hat{\gamma})$ where $\hat{\gamma}$ is the estimated threshold
parmater. Because the threshold parameter has to be estimated, however, the
p-value associated with the computed z-statistic will tend to be conservative
(larger than it should be under the null hypothesis). Royston (1992b) proposed
the following transformation of the z-statistic:
$$z' = \frac{z - \mu_z}{\sigma_z} \;\;\;\;\;\; (35)$$
where for $5 \le n \le 11$,
$$\mu_z = -3.8267 + 2.8242 u - 0.63673 u^2 - 0.020815 v \;\;\;\;\;\; (36)$$
$$\sigma_z = -4.9914 + 8.6724 u - 4.27905 u^2 + 0.70350 u^3 - 0.013431 v \;\;\;\;\;\; (37)$$
and for $12 \le n \le 2000$,
$$\mu_z = -3.7796 + 2.4038 u - 0.6675 u^2 - 0.082863 u^3 - 0.0037935 u^4 - 0.027027 v - 0.0019887 vu \;\;\;\;\;\; (38)$$
$$\sigma_z = 2.1924 - 1.0957 u + 0.33737 u^2 - 0.043201 u^3 + 0.0019974 u^4 - 0.0053312 vu \;\;\;\;\;\; (39)$$
where
$$u = log(n) \;\;\;\;\;\; (40)$$
$$v = u (\hat{\sigma} - \hat{\sigma}^2) \;\;\;\;\;\; (41)$$
$$\hat{\sigma}^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2 \;\;\;\;\;\; (42)$$
$$y_i = log(x_i - \hat{\gamma}) \;\;\;\;\;\; (43)$$
and $\gamma$ denotes the threshold parameter. The p-value associated with
this test is then given by:
$$p = 1 - \Phi(z') \;\;\;\;\;\; (44)$$
Testing Goodness-of-Fit for Any Continuous Distribution
The function gofTest
extends the Shapiro-Wilk test to test for
goodness-of-fit for any continuous distribution by using the idea of
Chen and Balakrishnan (1995), who proposed a general purpose approximate
goodness-of-fit test based on the Cramer-von Mises or Anderson-Darling
goodness-of-fit tests for normality. The function gofTest
modifies the
approach of Chen and Balakrishnan (1995) by using the same first 2 steps, and then
applying the Shapiro-Wilk test:
test="sf"
).
The Shapiro-Francia goodness-of-fit test (Shapiro and Francia, 1972;
Weisberg and Bingham, 1975; Royston, 1992c) is also one of the most commonly
used goodness-of-fit tests for normality. You can use it to test the following
hypothesized distributions:
Normal, Lognormal, Zero-Modified Normal,
or Zero-Modified Lognormal (Delta). In addition,
you can also use it to test the null hypothesis of any continuous distribution
that is available (see the help file for Distribution.df
). See the
section Testing Goodness-of-Fit for Any Continuous Distribution above for
an explanation of how this is done.
Royston's Transformation of the Shapiro-Francia W'-Statistic to Normality
Equation (13) above gives the formula for the Shapiro-Francia W'-statistic, and
Equation (16) above gives the formula for Weisberg-Bingham approximation to the
W'-statistic (denoted $\tilde{W}'$). Royston (1992c) presents an algorithm
to transform the $\tilde{W}'$-statistic so that its null distribution is
approximately a standard normal. For $5 \le n \le 5000$,
Royston (1992c) approximates the distribution of $(1-\tilde{W}')$ by a
lognormal distribution. Setting
$$z = \frac{w-\mu}{\sigma} \;\;\;\;\;\; (45)$$
the p-value associated with $\tilde{W}'$ is given by:
$$p = 1 - \Phi(z) \;\;\;\;\;\; (46)$$
The quantities necessary to compute $z$ are given by:
$$w = log(1 - \tilde{W}') \;\;\;\;\;\; (47)$$
$$\nu = log(n) \;\;\;\;\;\; (48)$$
$$u = log(\nu) - \nu \;\;\;\;\;\; (49)$$
$$\mu = -1.2725 + 1.0521 u \;\;\;\;\;\; (50)$$
$$v = log(\nu) + \frac{2}{\nu} \;\;\;\;\;\; (51)$$
$$\sigma = 1.0308 - 0.26758 v \;\;\;\;\;\; (52)$$
test="ppcc"
).
The PPPCC goodness-of-fit test (Filliben, 1975; Looney and Gulledge, 1985) can be
used to test the following hypothesized distributions:
Normal, Lognormal,
Zero-Modified Normal, or
Zero-Modified Lognormal (Delta). In addition,
you can also use it to test the null hypothesis of any continuous distribution that
is available (see the help file for Distribution.df
).
The function gofTest
computes the PPCC test
statistic using Blom plotting positions.
Filliben (1975) proposed using the correlation coefficient $r$ from a
normal probability plot to perform a goodness-of-fit test for
normality, and he provided a table of critical values for $r$ under the
for samples sizes between 3 and 100. Vogel (1986) provided an additional table
for sample sizes between 100 and 10,000.
Looney and Gulledge (1985) investigated the characteristics of Filliben's
probability plot correlation coefficient (PPCC) test using the plotting position
formulas given in Filliben (1975), as well as three other plotting position
formulas: Hazen plotting positions, Weibull plotting positions, and Blom plotting
positions (see the help file for qqPlot
for an explanation of these
plotting positions). They concluded that the PPCC test based on Blom plotting
positions performs slightly better than tests based on other plotting positions, and
they provide a table of empirical percentage points for the distribution of $r$
based on Blom plotting positions.
The function gofTest
computes the PPCC test statistic $r$ using Blom
plotting positions. It can be shown that the square of this statistic is
equivalent to the Weisberg-Bingham Approximation to the Shapiro-Francia
W'-Test (Weisberg and Bingham, 1975; Royston, 1993). Thus the PPCC
goodness-of-fit test is equivalent to the Shapiro-Francia goodness-of-fit test.
test="skew"
).
The Zero-skew goodness-of-fit test (D'Agostino, 1970) can be used to test the
following hypothesized distributions:
Normal, Lognormal, Zero-Modified Normal,
or Zero-Modified Lognormal (Delta).
When test="skew"
, the function gofTest
tests the null hypothesis
that the skew of the distribution is 0:
$$H_0: \sqrt{\beta}_1 = 0 \;\;\;\;\;\; (53)$$
where
$$\sqrt{\beta}_1 = \frac{\mu_3}{\mu_2^{3/2}} \;\;\;\;\;\; (54)$$
and the quantity $\mu_r$ denotes the $r$'th moment about the mean
(also called the $r$'th central moment). The quantity $\sqrt{\beta_1}$
is called the coefficient of skewness, and is estimated by:
$$\sqrt{b}_1 = \frac{m_3}{m_2^{3/2}} \;\;\;\;\;\; (55)$$
where
$$m_r = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^r \;\;\;\;\;\; (56)$$
denotes the $r$'th sample central moment.
The possible alternative hypotheses are:
$$H_a: \sqrt{\beta}_1 \ne 0 \;\;\;\;\;\; (57)$$
$$H_a: \sqrt{\beta}_1 < 0 \;\;\;\;\;\; (58)$$
$$H_a: \sqrt{\beta}_1 > 0 \;\;\;\;\;\; (59)$$
which correspond to alternative="two-sided"
, alternative="less"
, and
alternative="greater"
, respectively.
To test the null hypothesis of zero skew, D'Agostino (1970) derived an
approximation to the distribution of $\sqrt{b_1}$ under the null hypothesis of
zero-skew, assuming the observations comprise a random sample from a normal
(Gaussian) distribution. Based on D'Agostino's approximation, the statistic
$Z$ shown below is assumed to follow a standard normal distribution and is
used to compute the p-value associated with the test of $H_0$:
$$Z = \delta \;\; log{ \frac{Y}{\alpha} + [(\frac{Y}{\alpha})^2 + 1]^{1/2} } \;\;\;\;\;\; (60)$$
where
$$Y = \sqrt{b_1} [\frac{(n+1)(n+3)}{6(n-2)}]^{1/2} \;\;\;\;\;\; (61)$$
$$\beta_2 = \frac{3(n^2 + 27n - 70)(n+1)(n+3)}{(n-2)(n+5)(n+7)(n+9)} \;\;\;\;\;\; (62)$$
$$W^2 = -1 + \sqrt{2\beta_2 - 2} \;\;\;\;\;\; (63)$$
$$\delta = 1 / \sqrt{log(W)} \;\;\;\;\;\; (64)$$
$$\alpha = [2 / (W^2 - 1)]^{1/2} \;\;\;\;\;\; (65)$$
When the sample size $n$ is at least 150, a simpler approximation may be
used in which $Y$ in Equation (61) is assumed to follow a standard normal
distribution and is used to compute the p-value associated with the hypothesis
test.
test="ks"
).
When test="ks"
, the function gofTest
calls the Rfunction
ks.test
to compute the test statistic and p-value. Note that for the
one-sample case, the distribution parameters
should be pre-specified and not estimated from the data, and if the distribution parameters
are estimated from the data you will receive a warning that this test is very conservative
(Type I error smaller than assumed; high Type II error) in this case.
test="chisq"
).
The method used by gofTest
is a modification of what is used for chisq.test
.
If the hypothesized distribution function is completely specified, the degrees of
freedom are $m-1$ where $m$ denotes the number of classes. If any parameters
are estimated, the degrees of freedom depend on the method of estimation.
The function gofTest
follows the convention of computing
degrees of freedom as $m-1-k$, where $k$ is the number of parameters estimated.
It can be shown that if the parameters are estimated by maximum likelihood, the degrees of
freedom are bounded between $m-1$ and $m-1-k$. Therefore, especially when the
sample size is small, it is important to compare the test statistic to the chi-squared
distribution with both $m-1$ and $m-1-k$ degrees of freedom. See
Kendall and Stuart (1991, Chapter 30) for a more complete discussion.
The distribution theory of chi-square statistics is a large sample theory.
The expected cell counts are assumed to be at least moderately large.
As a rule of thumb, each should be at least 5. Although authors have found this rule
to be conservative (especially when the class probabilities are not too different
from each other), the user should regard p-values with caution when expected cell
counts are small.
test="ws"
).
Wilk and Shapiro (1968) suggested this test in the context of jointly testing several
independent samples for normality simultaneously. If $p_1, p_2, \ldots, p_n$ denote
the p-values associated with the test for normality of $n$ independent samples, then
under the null hypothesis that all $n$ samples come from a normal distribution, the
p-values are a random sample of $n$ observations from a Uniform [0,1] distribution,
that is a Uniform distribution with minimum 0 and maximum 1. Wilk and Shapiro (1968)
suggested two different methods for testing whether the p-values come from a Uniform [0, 1]
distribution:
alternative="greater"
). In terms
of the test statistic$G$, this alternative hypothesis would
tend to make$G$smaller than expected, so the p-value is given by$\Phi(G)$. For the one-sided lower alternative that the cdf for the
distribution of p-values is less than the cdf for a Uniform [0, 1]
distribution, the p-value is given by$$p = 1 - \Phi(G) \;\;\;\;\;\; (68)$$.alternative="greater"
). In terms
of the test statistic$C$, this alternative hypothesis would
tend to make$C$larger than expected, so the p-value is given by$$p = 1 - F_{2n}(C) \;\;\;\;\;\; (71)$$where$F_2n$denotes the cumulative distribution function of the
chi-square distribution with$2n$degrees of freedom.
For the one-sided lower alternative that
the cdf for the distribution of p-values is less than the cdf for a
Uniform [0, 1] distribution, the p-value is given by$$p = F_{2n}(C) \;\;\;\;\;\; (72)$$"gof"
containing the results of the goodness-of-fit test, unless
the two-sample Kolmogorov-Smirnov test is used, in which case the value is a list of
class "gofTwoSample"
. Objects of class "gof"
and "gofTwoSample"
have special printing and plotting methods. See the help files for gof.object
and gofTwoSample.object
for details.elnorm3
.
Ususally, the Shapiro-Wilk or Shapiro-Francia test is preferred to this test, unless
the direction of the alternative to normality (e.g., positive skew) is known
(D'Agostino, 1986b, pp. 405--406).
Kolmogorov (1933) introduced a goodness-of-fit test to test the hypothesis that a
random sample of $n$ observations x comes from a specific hypothesized distribution
with cumulative distribution function $H$. This test is now usually called the
one-sample Kolmogorov-Smirnov goodness-of-fit test. Smirnov (1939) introduced a
goodness-of-fit test to test the hypothesis that a random sample of $n$
observations x comes from the same distribution as a random sample of
$m$ observations y. This test is now usually called the two-sample
Kolmogorov-Smirnov goodness-of-fit test. Both tests are based on the maximum
vertical distance between two cumulative distribution functions. For the one-sample problem
with a small sample size, the Kolmogorov-Smirnov test may be preferred over the chi-squared
goodness-of-fit test since the KS-test is exact, while the chi-squared test is based on
an asymptotic approximation.
The chi-squared test, introduced by Pearson in 1900, is the oldest and best known
goodness-of-fit test. The idea is to reduce the goodness-of-fit problem to a
multinomial setting by comparing the observed cell counts with their expected values
under the null hypothesis. Grouping the data sacrifices information, especially if the
hypothesized distribution is continuous. On the other hand, chi-squared tests can be be
applied to any type of variable: continuous, discrete, or a combination of these.
The Wilk-Shapiro (1968) tests for a Uniform [0, 1] distribution were introduced in the context
of testing whether several independent samples all come from normal distributions, with
possibly different means and variances. The function gofGroupTest
extends
this idea to allow you to test whether several independent samples come from the same
distribution (e.g., gamma, extreme value, etc.), with possibly different parameters.
In practice, almost any goodness-of-fit test will not reject the null hypothesis
if the number of observations is relatively small. Conversely, almost any goodness-of-fit
test will reject the null hypothesis if the number of observations is very large,
since qqPlot
).rosnerTest
, gof.object
, print.gof
,
plot.gof
,
shapiro.test
, ks.test
, chisq.test
,
Normal, Lognormal, Lognormal3,
Zero-Modified Normal, Zero-Modified Lognormal (Delta),
enorm
, elnorm
, elnormAlt
,
elnorm3
, ezmnorm
, ezmlnorm
,
ezmlnormAlt
, qqPlot
.