Calibration or reliability of forecasts is the ability of a model to
correctly identify its own uncertainty in making predictions. In a model
with perfect calibration, the observed data at each time point look as if
they came from the predictive probability distribution at that time.
Equivalently, one can inspect the probability integral transform of the
predictive distribution at time t,
$$
u_t = F_t (x_t)
$$
where \(x_t\) is the observed data point at time \(t \textrm{ in } t_1,
…, t_n\), n being the number of forecasts, and \(F_t\) is
the (continuous) predictive cumulative probability distribution at time t. If
the true probability distribution of outcomes at time t is \(G_t\) then the
forecasts \(F_t\) are said to be ideal if \(F_t = G_t\) at all times t.
In that case, the probabilities \(u_t\) are distributed uniformly.
In the case of discrete outcomes such as incidence counts,
the PIT is no longer uniform even when forecasts are ideal.
In that case a randomised PIT can be used instead:
$$
u_t = P_t(k_t) + v * (P_t(k_t) - P_t(k_t - 1) )
$$
where \(k_t\) is the observed count, \(P_t(x)\) is the predictive
cumulative probability of observing incidence k at time t,
\(P_t (-1) = 0\) by definition and v is standard uniform and independent
of k. If \(P_t\) is the true cumulative
probability distribution, then \(u_t\) is standard uniform.
The function checks whether integer or continuous forecasts were provided.
It then applies the (randomised) probability integral and tests
the values \(u_t\) for uniformity using the
Anderson-Darling test.
As a rule of thumb, there is no evidence to suggest a forecasting model is
miscalibrated if the p-value found was greater than a threshold of p >= 0.1,
some evidence that it was miscalibrated if 0.01 < p < 0.1, and good
evidence that it was miscalibrated if p <= 0.01. However, the AD-p-values
may be overly strict and there actual usefulness may be questionable.
In this context it should be noted, though, that uniformity of the
PIT is a necessary but not sufficient condition of calibration.