testOutliers: Test for outliers

Description

This function tests if the number of observations outside the simulatio envelope are larger or smaller than expected

Usage

testOutliers(simulationOutput, alternative = c("two.sided", "greater",
  "less"), margin = c("both", "upper", "lower"), type = c("bootstrap",
  "binomial"), nBoot = 100, plot = T)

Arguments

simulationOutput

an object of class DHARMa with simulated quantile residuals, either created via simulateResiduals or by createDHARMa for simulations created outside DHARMa

alternative

a character string specifying whether the test should test if observations are "greater", "less" or "two.sided" (default) compared to the simulated null hypothesis

margin

whether to test for outliers only at the lower, only at the upper, or both sides (default) of the simulated data distribution

type

way to generate H0 for the test. See details

nBoot

number of boostrap replicates. Only used ot type = "bootstrap"

plot

if T, the function will create an additional plot

Details

DHARMa residuals are created by simulating from the fitted model, and comparing the simulated values to the observed data. It can occur that all simulated values are higher or smaller than the observed data, in which case they get the residual value of 0 and 1, respectively. I refer to these values as simulation outliers, or simply outliers.

Because no data was simulated in the range of the observed value, we don't know "how strongly" these values deviate from the model expectation, so the term "outlier" should be used with a grain of salt. It is not a judgment about the magnitude of the residual deviation, but simply a dichotomous sign that we are outside the simulated range.

Moreover, the number of outliers will decrease as we increase the number of simulations. Under the null hypothesis that the model is correct, and for continuous distributions, consider that if the data and the model distribution are identical, the probability that a given observation is higher than all simulations is 1/(nSim +1), and binomially distributed. The testOutlier function can test this null hypothesis via type = "binomial".

For integer-valued distributions, which are randomized via the PIT procedure (see simulateResiduals), the rate of "true" outliers is more difficult to calculate, and in general not 1/(nSim +1). The testOutlier function implements a small tweak that calculates the rate of residuals that are closer than 1/(nSim+1) to the 0/1 border, which roughly occur at a rate of nData /(nSim +1). This approximate value, however, is generally not exact.

For this reason, the testOutlier function implements an alternative procedure that uses the bootstrap to generate an expectation for the outliers. It is recommended to use the bootstrap for all integer-valued distributions, and in particular for non-bounded integer-valued distributions (such as Poisson or neg binom), ideally with reasonably high values of nSim and nBoot (I recommend at least 1000 for both, but note that the runtime for this is relatively high).

Both binomial or bootstrap generate a null expectation, and then test for an excess or lack of outliers. Per default, testOutliers() looks for both, so if you get a significant p-value, you have to check if you have to many or too few outliers. An excess of outliers is to be interpreted as too many values outside the simulation envelope. This could be caused by overdispersion, or by what we classically call outliers. A lack of outliers would be caused, for example, by underdispersion.

Examples

Run this code

# NOT RUN {
set.seed(123)

testData = createData(sampleSize = 200, overdispersion = 1, randomEffectVariance = 0)
fittedModel <- glm(observedResponse ~ Environment1 , family = "poisson", data = testData)
simulationOutput <- simulateResiduals(fittedModel = fittedModel)

# default outlier test (with plot)
testOutliers(simulationOutput)

# note that default is to test outliers at both margins for both an excess and a lack
# of outliers. Here we see that we mostly have an excess of outliers at the upper
# margin. You see that it is an excess because the frequency of outliers is 0.055,
# while expected is 0.008

# Let's see what would have happened if we would just have checked the lower margin
testOutliers(simulationOutput, margin = "lower", plot = FALSE)

# OK, now the frequency of outliers is 0, so we have too few, but this is n.s. against
# the expectation

# just for completeness, what would have happened if we would have checked both
# margins, but just for a lack of outliers (i.e. underdispersion)

testOutliers(simulationOutput, alternative = "less", plot = FALSE)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Details

See Also

Examples