eval_forecasts: Evaluate forecasts

Description

The function eval_forecasts is an easy to use wrapper of the lower level functions in the scoringutils package. It can be used to score probabilistic or quantile forecasts of continuous, integer-valued or binary variables.

Usage

eval_forecasts(
  data = NULL,
  by = NULL,
  summarise_by = by,
  metrics = NULL,
  quantiles = c(),
  sd = FALSE,
  interval_score_arguments = list(weigh = TRUE, count_median_twice = FALSE,
    separate_results = TRUE),
  pit_plots = FALSE,
  summarised = TRUE,
  verbose = TRUE,
  forecasts = NULL,
  truth_data = NULL,
  merge_by = NULL,
  compute_relative_skill = FALSE,
  rel_skill_metric = "auto",
  baseline = NULL
)

Arguments

data

A data.frame or data.table with the predictions and observations. Note: it is easiest to have a look at the example files provided in the package and in the examples below. The following columns need to be present:

true_value - the true observed values
prediction - predictions or predictive samples for one true value. (You only don't need to provide a prediction column if you want to score quantile forecasts in a wide range format.)

For integer and continuous forecasts a sample column is needed:

sample - an index to identify the predictive samples in the prediction column generated by one model for one true value. Only necessary for continuous and integer forecasts, not for binary predictions.

For quantile forecasts the data can be provided in variety of formats. You can either use a range-based format or a quantile-based format. (You can convert between formats using quantile_to_range_long, range_long_to_quantile, sample_to_range_long, sample_to_quantile) For a quantile-format forecast you should provide:

prediction - prediction to the corresponding quantile
quantile - quantile to which the prediction corresponds

For a range format (long) forecast you need

prediction the quantile forecasts
boundary values should be either "lower" or "upper", depending on whether the prediction is for the lower or upper bound of a given range
range the range for which a forecast was made. For a 50% interval the range should be 50. The forecast for the 25% quantile should have the value in the prediction column, the value of range should be 50 and the value of boundary should be "lower". If you want to score the median (i.e. range = 0), you still need to include a lower and an upper estimate, so the median has to appear twice.

Alternatively you can also provide the format in a wide range format. This format needs

pairs of columns called something like 'upper_90' and 'lower_90', or 'upper_50' and 'lower_50', where the number denotes the interval range. For the median, you need to provide columns called 'upper_0' and 'lower_0'

character vector of columns to group scoring by. This should be the lowest level of grouping possible, i.e. the unit of the individual observation. This is important as many functions work on individual observations. If you want a different level of aggregation, you should use summarise_by to aggregate the individual scores. Also not that the pit will be computed using summarise_by instead of by

summarise_by

character vector of columns to group the summary by. By default, this is equal to `by` and no summary takes place. But sometimes you may want to to summarise over categories different from the scoring. summarise_by is also the grouping level used to compute (and possibly plot) the probability integral transform(pit).

metrics

the metrics you want to have in the output. If `NULL` (the default), all available metrics will be computed.

quantiles

numeric vector of quantiles to be returned when summarising. Instead of just returning a mean, quantiles will be returned for the groups specified through `summarise_by`. By default, no quantiles are returned.

if TRUE (the default is FALSE) the standard deviation of all metrics will be returned when summarising.

interval_score_arguments

list with arguments for the calculation of the interval score. These arguments get passed down to interval_score, except for the argument `count_median_twice` that controls how the interval scores for different intervals are summed up. This should be a logical (default is FALSE) that indicates whether or not to count the median twice when summarising. This would conceptually treat the median as a 0% prediction interval, where the median is the lower as well as the upper bound. The alternative is to treat the median as a single quantile forecast instead of an interval. The interval score would then be better understood as an average of quantile scores.)

pit_plots

if TRUE (not the default), pit plots will be returned. For details see pit.

summarised

Summarise arguments (i.e. take the mean per group specified in group_by. Default is TRUE.

verbose

print out additional helpful messages (default is TRUE)

forecasts

data.frame with forecasts, that should follow the same general guidelines as the `data` input. Argument can be used to supply forecasts and truth data independently. Default is `NULL`.

truth_data

data.frame with a column called `true_value` to be merged with `forecasts`

merge_by

character vector with column names that `forecasts` and `truth_data` should be merged on. Default is `NULL` and merge will be attempted automatically.

compute_relative_skill

logical, whether or not to compute relative performance between models. If `TRUE` (default is FALSE), then a column called 'model' must be present in the input data. For more information on the computation of relative skill, see pairwise_comparison. Relative skill will be calculated for the aggregation level specified in `summarise_by`.

rel_skill_metric

character string with the name of the metric for which a relative skill shall be computed. If equal to 'auto' (the default), then one of interval score, crps or brier score will be used where appropriate

baseline

character string with the name of a model. If a baseline is given, then a scaled relative skill with respect to the baseline will be returned. By default (`NULL`), relative skill will not be scaled with respect to a baseline model.

Value

A data.table with appropriate scores. For binary predictions, the Brier Score will be returned, for quantile predictions the interval score, as well as adapted metrics for calibration, sharpness and bias. For integer forecasts, Sharpness, Bias, DSS, CRPS, LogS, and pit_p_val (as an indicator of calibration) are returned. For integer forecasts, pit_sd is returned (to account for the randomised PIT), but no Log Score is returned (the internal estimation relies on a kernel density estimate which is difficult for integer-valued forecasts). If summarise_by is specified differently from by, the average score per summary unit is returned. If specified, quantiles and standard deviation of scores can also be returned when summarising.

Details

the following metrics are used where appropriate:

Interval Score for quantile forecasts. Smaller is better. See interval_score for more information. By default, the weighted interval score is used.
Brier Score for a probability forecast of a binary outcome. Smaller is better. See brier_score for more information.
aem Absolute error of the median prediction
Bias 0 is good, 1 and -1 are bad. See bias for more information.
Sharpness Smaller is better. See sharpness for more information.
Calibration represented through the p-value of the Anderson-Darling test for the uniformity of the Probability Integral Transformation (PIT). For integer valued forecasts, this p-value also has a standard deviation. Larger is better. See pit for more information.
DSS Dawid-Sebastiani-Score. Smaller is better. See dss for more information.
CRPS Continuous Ranked Probability Score. Smaller is better. See crps for more information.
Log Score Smaller is better. Only for continuous forecasts. See logs for more information.

References

Funk S, Camacho A, Kucharski AJ, Lowe R, Eggo RM, Edmunds WJ (2019) Assessing the performance of real-time epidemic forecasts: A case study of Ebola in the Western Area region of Sierra Leone, 2014-15. PLoS Comput Biol 15(2): e1006785. <doi.org/10.1371/journal.pcbi.1006785>

Examples

Run this code

# NOT RUN {
## Probability Forecast for Binary Target
binary_example <- data.table::setDT(scoringutils::binary_example_data)
eval <- scoringutils::eval_forecasts(binary_example,
                                     summarise_by = c("model"),
                                     quantiles = c(0.5), sd = TRUE,
                                     verbose = FALSE)

## Quantile Forecasts
# wide format example (this examples shows usage of both wide formats)
range_example_wide <- data.table::setDT(scoringutils::range_example_data_wide)
range_example <- scoringutils::range_wide_to_long(range_example_wide)
# equivalent:
wide2 <- data.table::setDT(scoringutils::range_example_data_semi_wide)
range_example <- scoringutils::range_wide_to_long(wide2)
eval <- scoringutils::eval_forecasts(range_example,
                                     summarise_by = "model",
                                     quantiles = c(0.05, 0.95),
                                     sd = TRUE)
eval <- scoringutils::eval_forecasts(range_example)

#long format

eval <- scoringutils::eval_forecasts(scoringutils::range_example_data_long,
                                     summarise_by = c("model", "range"))

## Integer Forecasts
integer_example <- data.table::setDT(scoringutils::integer_example_data)
eval <- scoringutils::eval_forecasts(integer_example,
                                     summarise_by = c("model"),
                                     quantiles = c(0.1, 0.9),
                                     sd = TRUE,
                                     pit_plots = TRUE)
eval <- scoringutils::eval_forecasts(integer_example)

## Continuous Forecasts
continuous_example <- data.table::setDT(scoringutils::continuous_example_data)
eval <- scoringutils::eval_forecasts(continuous_example)
eval <- scoringutils::eval_forecasts(continuous_example,
                                     quantiles = c(0.5, 0.9),
                                     sd = TRUE,
                                     summarise_by = c("model"))

# }

Run the code above in your browser using DataLab