chenTTest: Chen's Modified One-Sided t-test for Skewed Distributions

Description

For a skewed distribution, estimate the mean, standard deviation, and skew; test the null hypothesis that the mean is equal to a user-specified value vs. a one-sided alternative; and create a one-sided confidence interval for the mean.

Usage

chenTTest(x, y = NULL, alternative = "greater", mu = 0, paired = !is.null(y), 
    conf.level = 0.95, ci.method = "z")

Value

a list of class "htest" containing the results of the hypothesis test. See the help file for htest.object for details.

Arguments

x: numeric vector of observations. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are allowed but will be removed.
y: optional numeric vector of observations that are paired with the observations in x. The length of y must be the same as the length of x. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are allowed but will be removed. This argument is ignored if paired=FALSE, and must be supplied if paired=TRUE. The default value is y=NULL.
alternative: character string indicating the kind of alternative hypothesis. The possible values are "greater" (the default) and "less". The value "greater" should be used for positively-skewed distributions, and the value "less" should be used for negatively-skewed distributions.
mu: numeric scalar indicating the hypothesized value of the mean. The default value is mu=0.
paired: character string indicating whether to perform a paired or one-sample t-test. The possible values are paired=FALSE (the default; indicates a one-sample t-test) and paired=TRUE.
conf.level: numeric scalar between 0 and 1 indicating the confidence level associated with the confidence interval for the population mean. The default value is
conf.level=0.95.
ci.method: character string indicating which critical value to use to construct the confidence interval for the mean. The possible values are "z" (the default), "t", and "Avg. of z and t". See the DETAILS section below for more information.

Author

Steven P. Millard (EnvStats@ProbStatInfo.com)

Details

One-Sample Case (paired=FALSE)
Let $\underline{x} = (x_1, x_2, \ldots, x_n)$ be a vector of $n$ independent and identically distributed (i.i.d.) observations from some distribution with mean $\mu$ and standard deviation $\sigma$.

Background: The Conventional Student's t-Test
Assume that the $n$ observations come from a normal (Gaussian) distribution, and consider the test of the null hypothesis: $$H_0: \mu = \mu_0 \;\;\;\;\;\; (1)$$ The three possible alternative hypotheses are the upper one-sided alternative (alternative="greater"): $$H_a: \mu > \mu_0 \;\;\;\;\;\; (2)$$ the lower one-sided alternative (alternative="less"): $$H_a: \mu < \mu_0 \;\;\;\;\;\; (3)$$ and the two-sided alternative: $$H_a: \mu \ne \mu_0 \;\;\;\;\;\; (4)$$ The test of the null hypothesis (1) versus any of the three alternatives (2)-(4) is usually based on the Student t-statistic: $$t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} \;\;\;\;\;\; (5)$$ where $$\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i \;\;\;\;\;\; (6)$$ $$s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \;\;\;\;\;\; (7)$$ (see the R help file for t.test). Under the null hypothesis (1), the t-statistic in (5) follows a Student's t-distribution with $n-1$ degrees of freedom (Zar, 2010, p.99; Johnson et al., 1995, pp.362-363). The t-statistic is fairly robust to departures from normality in terms of maintaining Type I error and power, provided that the sample size is sufficiently large.

Chen's Modified t-Test for Skewed Distributions
In the case when the underlying distribution of the $n$ observations is positively skewed and the sample size is small, the sampling distribution of the t-statistic under the null hypothesis (1) does not follow a Student's t-distribution, but is instead negatively skewed. For the test against the upper alternative in (2) above, this leads to a Type I error smaller than the one assumed and a loss of power (Chen, 1995b, p.767).

Similarly, in the case when the underlying distribution of the $n$ observations is negatively skewed and the sample size is small, the sampling distribution of the t-statistic is positively skewed. For the test against the lower alternative in (3) above, this also leads to a Type I error smaller than the one assumed and a loss of power.

In order to overcome these problems, Chen (1995b) proposed the following modified t-statistic that takes into account the skew of the underlying distribution: $$t_2 = t + a(1 + 2t^2) + 4a^2(t + 2t^3) \;\;\;\;\;\; (8)$$ where $$a = \frac{\sqrt{\hat{\beta}_1}}{6n} \;\;\;\;\;\; (9)$$ $$\hat{\beta}_1 = \frac{\hat{\mu}_3}{\hat{\sigma}^3} \;\;\;\;\;\; (10)$$ $$\hat{\mu}_3 = \frac{n}{(n-1)(n-2)} \sum_{i=1}^n (x_i - \bar{x})^3 \;\;\;\;\;\; (11)$$ $$\hat{\sigma}^3 = s^3 = [\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2]^{3/2} \;\;\;\;\;\; (12)$$ Note that the quantity $\sqrt{\hat{\beta}_1}$ in (9) is an estimate of the skew of the underlying distribution and is based on unbiased estimators of central moments (see the help file for skewness).

For a positively-skewed distribution, Chen's modified t-test rejects the null hypothesis (1) in favor of the upper one-sided alternative (2) if the t-statistic in (8) is too large. For a negatively-skewed distribution, Chen's modified t-test rejects the null hypothesis (1) in favor of the lower one-sided alternative (3) if the t-statistic in (8) is too small.

Chen's modified t-test is not applicable to testing the two-sided alternative (4). It should also not be used to test the upper one-sided alternative (2) based on negatively-skewed data, nor should it be used to test the lower one-sided alternative (3) based on positively-skewed data.

Determination of Critical Values and p-Values
Chen (1995b) performed a simulation study in which the modified t-statistic in (8) was compared to a critical value based on the normal distribution (z-value), a critical value based on Student's t-distribution (t-value), and the average of the critical z-value and t-value. Based on the simulation study, Chen (1995b) suggests using either the z-value or average of the z-value and t-value when $n$ (the sample size) is small (e.g., $n \le 10$) or $\alpha$ (the Type I error) is small (e.g. $\alpha \le 0.01$), and using either the t-value or the average of the z-value and t-value when $n \ge 20$ or $\alpha \ge 0.05$.

The function chenTTest returns three different p-values: one based on the normal distribution, one based on Student's t-distribution, and one based on the average of these two p-values. This last p-value should roughly correspond to a p-value based on the distribution of the average of a normal and Student's t random variable.

Computing Confidence Intervals
The function chenTTest computes a one-sided confidence interval for the true mean $\mu$ based on finding all possible values of $\mu$ for which the null hypothesis (1) will not be rejected, with the confidence level determined by the argument conf.level. The argument ci.method determines which p-value is used in the algorithm to determine the bounds on $\mu$. When ci.method="z", the p-value is based on the normal distribution, when ci.method="t", the p-value is based on Student's t-distribution, and when ci.method="Avg. of z and t" the p-value is based on the average of the p-values based on the normal and Student's t-distribution.

Paired-Sample Case (paired=TRUE)
When the argument paired=TRUE, the arguments x and y are assumed to have the same length, and the $n$ differences $$d_i = x_i - y_i, \;\; i = 1, 2, \ldots, n$$ are assumed to be i.i.d. observations from some distribution with mean $\mu$ and standard deviation $\sigma$. Chen's modified t-test can then be applied to the differences.

References

Chen, L. (1995b). Testing the Mean of Skewed Distributions. Journal of the American Statistical Association 90(430), 767--772.

Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York, Chapters 28, 31.

Land, C.E. (1971). Confidence Intervals for Linear Functions of the Normal Mean and Variance. The Annals of Mathematical Statistics 42(4), 1187--1205.

Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, pp.402--404.

Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.

USEPA. (1996c). Soil Screening Guidance: Technical Background Document. EPA/540/R-95/128, PB96963502. Office of Emergency and Remedial Response, U.S. Environmental Protection Agency, Washington, D.C., May, 1996.

USEPA. (2002d). Estimation of the Exposure Point Concentration Term Using a Gamma Distribution. EPA/600/R-02/084. October 2002. Technology Support Center for Monitoring and Site Characterization, Office of Research and Development, Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.

Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.

Examples

Run this code

  # The guidance document "Calculating Upper Confidence Limits for 
  # Exposure Point Concentrations at Hazardous Waste Sites" 
  # (USEPA, 2002d, Exhibit 9, p. 16) contains an example of 60 observations 
  # from an exposure unit.  Here we will use Chen's modified t-test to test 
  # the null hypothesis that the average concentration is less than 30 mg/L 
  # versus the alternative that it is greater than 30 mg/L.
  # In EnvStats these data are stored in the vector EPA.02d.Ex.9.mg.per.L.vec.

  sort(EPA.02d.Ex.9.mg.per.L.vec)
  # [1]  16  17  17  17  18  18  20  20  20  21  21  21  21  21  21  22
  #[17]  22  22  23  23  23  23  24  24  24  25  25  25  25  25  25  26
  #[33]  26  26  26  27  27  28  28  28  28  29  29  30  30  31  32  32
  #[49]  32  33  33  35  35  97  98 105 107 111 117 119

  dev.new()
  hist(EPA.02d.Ex.9.mg.per.L.vec, col = "cyan", xlab = "Concentration (mg/L)")

  # The Shapiro-Wilk goodness-of-fit test rejects the null hypothesis of a 
  # normal, lognormal, and gamma distribution:

  gofTest(EPA.02d.Ex.9.mg.per.L.vec)$p.value
  #[1] 2.496781e-12

  gofTest(EPA.02d.Ex.9.mg.per.L.vec, dist = "lnorm")$p.value
  #[1] 3.349035e-09

  gofTest(EPA.02d.Ex.9.mg.per.L.vec, dist = "gamma")$p.value
  #[1] 1.564341e-10


  # Use Chen's modified t-test to test the null hypothesis that
  # the average concentration is less than 30 mg/L versus the 
  # alternative that it is greater than 30 mg/L.

  chenTTest(EPA.02d.Ex.9.mg.per.L.vec, mu = 30)

  #Results of Hypothesis Test
  #--------------------------
  #
  #Null Hypothesis:                 mean = 30
  #
  #Alternative Hypothesis:          True mean is greater than 30
  #
  #Test Name:                       One-sample t-Test
  #                                 Modified for
  #                                 Positively-Skewed Distributions
  #                                 (Chen, 1995)
  #
  #Estimated Parameter(s):          mean = 34.566667
  #                                 sd   = 27.330598
  #                                 skew =  2.365778
  #
  #Data:                            EPA.02d.Ex.9.mg.per.L.vec
  #
  #Sample Size:                     60
  #
  #Test Statistic:                  t = 1.574075
  #
  #Test Statistic Parameter:        df = 59
  #
  #P-values:                        z               = 0.05773508
  #                                 t               = 0.06040889
  #                                 Avg. of z and t = 0.05907199
  #
  #Confidence Interval for:         mean
  #
  #Confidence Interval Method:      Based on z
  #
  #Confidence Interval Type:        Lower
  #
  #Confidence Level:                95%
  #
  #Confidence Interval:             LCL = 29.82
  #                                 UCL =   Inf

  # The estimated mean, standard deviation, and skew are 35, 27, and 2.4, 
  # respectively.  The p-value is 0.06, and the lower 95% confidence interval 
  # is [29.8, Inf).  Depending on what you use for your Type I error rate, you 
  # may or may not want to reject the null hypothesis.

Run the code above in your browser using DataLab