LcKS: Lilliefors-corrected Kolmogorov-Smirnov Goodness-Of-Fit Test

Description

Implements the Lilliefors-corrected Kolmogorov-Smirnov test for use in goodness-of-fit tests, suitable when population parameters are unknown and must be estimated by sample statistics. It uses Monte Carlo simulation to estimate p-values. Using a modification of ks.test, it can be used with a variety of continuous distributions, including normal, lognormal, univariate mixtures of normals, uniform, loguniform, exponential, gamma, and Weibull distributions. The Monte Carlo algorithm can run 'in parallel.'

Usage

LcKS(x, cdf, nreps = 4999, G = 1:9, varModel = c("E", "V"),
  parallel = FALSE, cores = NULL)

Arguments

A numeric vector of data values (observed sample).

cdf

Character string naming a cumulative distribution function. Case insensitive. Only continuous CDFs are valid. Allowed CDFs include:

"pmixnorm" for (univariate) normal mixture,
"plnorm" for lognormal (log-normal, log normal),
"punif" for uniform,
"plunif" for loguniform (log-uniform, log uniform),
"pexp" for exponential,
"pgamma" for gamma,
"pweibull" for Weibull.

nreps

Number of replicates to use in simulation algorithm. Default = 4999 replicates. See details below. Should be a positive integer.

Numeric vector of mixture components to consider, for mixture models only. Default = 1:9 fits up to 9 components. Must contain positive integers. See details below.

varModel

For mixture models, character string determining whether to allow equal-variance mixture components (E), variable-variance mixture components (V) or both (the default).

parallel

Logical value that switches between running Monte Carlo algorithm in parallel (if TRUE) or not (if FALSE, the default).

cores

Numeric value to control how many cores to build when running in parallel. Default = detectCores - 1.

Value

A list containing the following components:

D.obs

The value of the test statistic D for the observed sample.

D.sim

Simulation distribution of test statistics, with length = nreps. This can be used to calculate critical values; see examples.

p.value

p-value of the test, calculated as \((\sum(D.sim > D.obs) + 1) / (nreps + 1)\).

Details

The function builds a simulation distribution D.sim of length nreps by drawing random samples from the specified continuous distribution function cdf with parameters calculated from the provided sample x. Observed statistic D and simulated test statistics are calculated using a simplified version of ks.test.

The default nreps = 4999 provides accurate p-values. nreps = 1999 is sufficient for most cases, and computationally faster when dealing with more complicated distributions (such as univariate normal mixtures, gamma, and Weibull). See below for potentially faster parallel implementations.

The p-value is calculated as the number of Monte Carlo samples with test statistics D as extreme as or more extreme than that in the observed sample D.obs, divided by the nreps number of Monte Carlo samples. A value of 1 is added to both the numerator and denominator to allow the observed sample to be represented within the null distribution (Manly 2004); this has the benefit of avoiding nonsensical p.value = 0.000 and accounts for the fact that the p-value is an estimate.

Parameter estimates are calculated for the specified continuous distribution, using maximum-likelihood estimates. When testing against the gamma and Weibull distributions, MASS::fitdistr is used to calculate parameter estimates using maximum likelihood optimization, with sensible starting values. Because this incorporates an optimization routine, the simulation algorithm can be slow if using large nreps or problematic samples. Warnings often occur during these optimizations, caused by difficulties estimating sample statistic standard errors. Because such SEs are not used in the Lilliefors-corrected simulation algorithm, warnings are suppressed during these optimizations.

Sample statistics for the (univariate) normal mixture distribution pmixnorm are calculated using package mclust, which uses BIC to identify the optimal mixture model for the sample, and the EM algorithm to calculate parameter estimates for this model. The number of mixture components G (with default allowing up to 9 components), variance model (whether equal E or variable V variance), and component statistics (means, sds, and mixing proportions pro) are estimated from the sample when calculating D.obs and passed internally when creating random Monte Carlo samples. It is possible that some of these samples may differ in their optimal G (for example a two-component input sample might yield a three-component random sample within the simulation distribution). This can be constrained by specifying that simulation BIC-optimizations only consider G mixture components.

Be aware that constraining G changes the null hypothesis. The default (G = 1:9) null hypothesis is that a sample was drawn from any G = 1:9-component mixture distribution. Specifying a particular value, such as G = 2, restricts the null hypothesis to particular mixture distributions with just G components, even if simulated samples might better be represented as different mixture models.

The LcKS(cdf = "pmixnorm") test implements two control loops to avoid errors caused by this constraint and when working with problematic samples. The first loop occurs during model-selection for the observed sample x, and allows for estimation of parameters for the second-best model when those for the optimal model are not able to be calculated by the EM algorithm. A second loop occurs during the simulation algorithm, rejecting samples that cannot be fit by the mixture model specified by the observed sample x. Such problematic cases are most common when the observed or simulated samples have a component(s) with very small variance (i.e., duplicate observations) or when a Monte Carlo sample cannot be fit by the specified G.

Parellel computing can be implemented using parallel = TRUE, using the operating-system versatile doParallel-package and foreach infrastructure, using a default detectCores - 1 number of cores. Parallel computing is generally advisable for the more complicated cumulative density functions (i.e., univariate normal mixture, gamma, Weibull), where maximum likelihood estimation is time-intensive, but is generally not advisable for density functions with quickly calculated sample statistics (i.e., other distribution functions). Warnings within the function provide sensible recommendations, but users are encouraged to experiment to discover their fastest implementation for their individual cases.

References

Lilliefors, H. W. 1967. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association 62(318):399-402.

Lilliefors, H. W. 1969. On the Kolmogorov-Smirnov test for the exponential distribution with mean unknown. Journal of the American Statistical Association 64(325):387-389.

Manly, B. F. J. 2004. Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall, Cornwall, Great Britain.

Parsons, F. G., and P. H. Wirsching. 1982. A Kolmogorov-Smirnov goodness-of-fit test for the two-parameter Weibull distribution when the parameters are estimated from the data. Microelectronics Reliability 22(2):163-167.

Examples

Run this code

# NOT RUN {
x <- runif(200)
Lc <- LcKS(x, cdf = "pnorm", nreps = 999)
hist(Lc$D.sim)
abline(v = Lc$D.obs, lty = 2)
print(Lc, max = 50)  # Print first 50 simulated statistics
# Approximate p-value (usually) << 0.05

# Confirmation uncorrected version has increased Type II error rate when
#   using sample statistics to estimate parameters:
ks.test(x, "pnorm", mean(x), sd(x))   # p-value always larger, (usually) > 0.05

# Confirm critical values for normal distribution are correct
nreps <- 9999
x <- rnorm(25)
Lc <- LcKS(x, "pnorm", nreps = nreps)
sim.Ds <- sort(Lc$D.sim)
crit <- round(c(.8, .85, .9, .95, .99) * nreps, 0)
# Lilliefors' (1967) critical values, using improved values from
#   Parsons & Wirsching (1982) (for n = 25):
# 0.141 0.148 0.157 0.172 0.201
round(sim.Ds[crit], 3)			# Approximately the same critical values

# Confirm critical values for exponential are the same as reported by Lilliefors (1969)
nreps <- 9999
x <- rexp(25)
Lc <- LcKS(x, "pexp", nreps = nreps)
sim.Ds <- sort(Lc$D.sim)
crit <- round(c(.8, .85, .9, .95, .99) * nreps, 0)
# Lilliefors' (1969) critical values (for n = 25):
# 0.170 0.180 0.191 0.210 0.247
round(sim.Ds[crit], 3)			# Approximately the same critical values

# }
# NOT RUN {
# Gamma and Weibull tests require functions from the 'MASS' package
# Takes time for maximum likelihood optimization of statistics
require(MASS)
x <- runif(100, min = 1, max = 100)
Lc <- LcKS(x, cdf = "pgamma", nreps = 499)
Lc$p.value

# Confirm critical values for Weibull the same as reported by Parsons & Wirsching (1982)
nreps <- 9999
x <- rweibull(25, shape = 1, scale = 1)
Lc <- LcKS(x, "pweibull", nreps = nreps)
sim.Ds <- sort(Lc$D.sim)
crit <- round(c(.8, .85, .9, .95, .99) * nreps, 0)
# Parsons & Wirsching (1982) critical values (for n = 25):
# 0.141 0.148 0.157 0.172 0.201
round(sim.Ds[crit], 3)			# Approximately the same critical values

# Mixture test requires functions from the 'mclust' package
# Takes time to identify model parameters
require(mclust)
x <- rmixnorm(200, mean = c(10, 20), sd = 2, pro = c(1,3))
Lc <- LcKS(x, cdf = "pmixnorm", nreps = 499, G = 1:9)   # Default G (1:9) takes long time
Lc$p.value
G <- Mclust(x)$parameters$variance$G              # Optimal model has only two components
Lc <- LcKS(x, cdf = "pmixnorm", nreps = 499, G = G)     # Restricting to likely G saves time
# But note changes null hypothesis: now testing against just two-component mixture
Lc$p.value

# Running 'in parallel'
require(doParallel)
set.seed(3124)
x <- rmixnorm(300, mean = c(110, 190, 200), sd = c(3, 15, .1), pro = c(1, 3, 1))
system.time(LcKS(x, "pgamma"))
system.time(LcKS(x, "pgamma", parallel = TRUE)) # Should be faster
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab