The `scoringutils`

package provides a collection of metrics and proper scoring rules that make it simple to score forecasts against the true observed values. Predictions can either be automatically scored from a `data.frame`

using the function `eval_forecasts`

. Alternatively, evaluation metrics can be accessed directly using lower level functions within a vector/matrix framework.

Predictions can be handled in various formats: `scoringutils`

can handle probabilistic forecasts in either a sample based or a quantile based format. For more detail on the expected input formats please see below. True values can be integer, continuous or binary.

In addition to automatic scoring, `scoringutils`

offers a variety of plots and visualisations.

Most of the time, the `eval_forecasts`

function will be able to do the entire evaluation for you. The idea is simple, yet flexible.

All you need to do is to pass in a `data.frame`

that has a column called `prediction`

and one called `true_value`

. Depending on the exact input format, additional columns like `sample`

, `quantile`

or `range`

and `boundary`

are needed. Additional columns may be present to indicate a grouping of forecasts. For example, we could have forecasts made by different models in various locations at different time points, each for several weeks into the future. In this case, we would have additional columns called for example `model`

, `date`

, `forecast_date`

, `forecast_horizon`

and `location`

.

Using the `by`

argument you need to specify the *unit of a single forecast*. In this example here we would set `by = c("model", "date", "forecast_date", "forecast_horizon", "location")`

(note: if we want to be pedantic, there is a small duplication as the information of “date” is already included in the combination of “forecast_date” and “forecast_horizon”. But as long as there isn’t some weird shift, this doesn’t matter for the purpose of grouping our observations). If you don’t specify `by`

(i.e. `by = NULL`

), `scoringutils`

will automatically use all appropriate present columns. Note that you don’t need to include columns such as `quantile`

or `sample`

in the `by`

argument, as several quantiles / samples make up one forecast.

Using the `summarise_by`

argument you can now choose categories to aggregate over. If you were only interested in scores for the different models, you would specify `summarise_by = c("model")`

. If you wanted to have scores for every model in every location, you would need to specify `summarise_by = c("model", "location")`

. If you wanted to have one score per quantile or one per prediction interval range, you could specify something like `summarise_by = c("model", "quantile")`

or `summarise_by = c("model", "quantile", "range")`

(note again that some information is duplicated in quantile and range, but this doesn’t really matter for grouping purposes). When aggregating, `eval_forecasts`

takes the mean according to the group defined in `summarise_by`

(i.e. in this example, if `summarise_by = c("model", "location")`

, scores will be averaged over all forecast dates, forecast horizons and quantiles to yield one score per model and location). In addition to the mean, you can also obtain the standard deviation of the scores over which you average or any desired quantile (e.g. the median in addition to the mean) by specifying `sd = TRUE`

and `quantiles = c(0.5)`

.

Here is an example of an evaluation using the example data included in the package. The data comes from a set of Covid-19 short-term forecasts in the UK.

```
library(scoringutils)
#> Note: The definition of the weighted interval score has slightly changed in version 0.1.5. If you want to use the old definition, use the argument `count_median_twice = TRUE` in the function `eval_forecasts()`
library(data.table)
```

```
<- scoringutils::quantile_example_data
data print(data, 3, 3)
#> value_date value_type geography value_desc true_value
#> 1: 2020-05-04 hospital_inc England Hospital admissions 1043
#> 2: 2020-05-04 hospital_prev England Total beds occupied 10648
#> 3: 2020-05-11 hospital_inc England Hospital admissions 743
#> ---
#> 5150: 2020-08-03 death_inc_line Wales Deaths 1
#> 5151: 2020-08-03 death_inc_line Wales Deaths 1
#> 5152: 2020-08-03 death_inc_line Wales Deaths 1
#> model creation_date quantile prediction horizon
#> 1: <NA> <NA> NA NA NA
#> 2: <NA> <NA> NA NA NA
#> 3: <NA> <NA> NA NA NA
#> ---
#> 5150: SIRCOVID 2020-07-13 0.85 4 21
#> 5151: DetSEIRwithNB MCMC 2020-07-13 0.90 2 21
#> 5152: SIRCOVID 2020-07-13 0.90 6 21
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model", "quantile", "range"))
print(scores, 3, 3)
#> model quantile range interval_score sharpness underprediction
#> 1: DetSEIRwithNB MCMC 0.50 0 54.45528 0.000000 54.16260163
#> 2: DetSEIRwithNB MCMC 0.45 10 53.96138 6.310976 47.42276423
#> 3: DetSEIRwithNB MCMC 0.55 10 53.96138 6.310976 47.42276423
#> ---
#> 55: SIRCOVID 0.90 80 18.18000 17.368889 0.15555556
#> 56: SIRCOVID 0.05 90 11.69444 11.661111 0.03333333
#> 57: SIRCOVID 0.95 90 11.69444 11.661111 0.03333333
#> overprediction coverage coverage_deviation bias aem
#> 1: 0.2926829 0.3170732 0.31707317 -0.2333333 54.45528
#> 2: 0.2276423 0.4308943 0.33089431 -0.2333333 54.45528
#> 3: 0.2276423 0.4308943 0.33089431 -0.2333333 54.45528
#> ---
#> 55: 0.6555556 0.9333333 0.13333333 0.2255556 42.90000
#> 56: 0.0000000 0.9888889 0.08888889 0.2255556 42.90000
#> 57: 0.0000000 0.9888889 0.08888889 0.2255556 42.90000
#> quantile_coverage
#> 1: 0.4959350
#> 2: 0.4308943
#> 3: 0.5691057
#> ---
#> 55: 0.9777778
#> 56: 0.2666667
#> 57: 0.9888889
```

Using an appropriate level of summary, we can easily use the output for visualisation. The `scoringutils`

package offers some built-in functions to help get a sense of the data

```
#
# filtered_data <- data[geography == "England" &
# creation_date <= "2020-06-29" &
# value_desc == "Deaths"]
::plot_predictions(data = data,
scoringutilsfilter_both = list("geography == 'England'"),
filter_forecasts = list("creation_date == '2020-07-06'"),
filter_truth = list("as.Date(value_date) <= '2020-07-06'"),
x = "value_date",
range = c(0, 50, 90),
scale = "free",
facet_formula = value_desc ~ model)
```

```
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model"))
::score_table(scores) scoringutils
```

Given this level of aggregation, not all metrics may make sense. In this case, for example, averaging over different quantiles to compute quantile coverage does not make much sense. If you like, you can select specific metrics for the visualisation.

Let us look at calibration:

```
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model", "range", "quantile"))
::interval_coverage(scores) +
scoringutils::ggtitle("Interval Coverage")
ggplot2
::quantile_coverage(scores) +
scoringutils::ggtitle("Quantile Coverage") ggplot2
```

Let us look at the individual components of the weighted interval score:

```
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model", "value_desc"))
::wis_components(scores, facet_formula = ~ value_desc) scoringutils
```

We can also look at contributions to different metrics by range:

```
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model", "range", "value_desc"))
::range_plot(scores, y = "interval_score",
scoringutilsfacet_formula = ~ value_desc)
```

We can also visualise metrics using a heatmap:

```
<- scoringutils::eval_forecasts(data,
scores summarise_by = c("model", "horizon"))
<- scores[, horizon := as.factor(horizon)]
scores ::score_heatmap(scores,
scoringutilsx = "horizon", metric = "bias")
```

The `eval_forecasts`

function is designed to work with various different input formats. The following formats are currently supported:

quantile forecasts in either a plain quantile format or in a format that specifies interval ranges and the boundary of a given interval range.

```
print(scoringutils::quantile_example_data, 3, 3)
#> value_date value_type geography value_desc true_value
#> 1: 2020-05-04 hospital_inc England Hospital admissions 1043
#> 2: 2020-05-04 hospital_prev England Total beds occupied 10648
#> 3: 2020-05-11 hospital_inc England Hospital admissions 743
#> ---
#> 5150: 2020-08-03 death_inc_line Wales Deaths 1
#> 5151: 2020-08-03 death_inc_line Wales Deaths 1
#> 5152: 2020-08-03 death_inc_line Wales Deaths 1
#> model creation_date quantile prediction horizon
#> 1: <NA> <NA> NA NA NA
#> 2: <NA> <NA> NA NA NA
#> 3: <NA> <NA> NA NA NA
#> ---
#> 5150: SIRCOVID 2020-07-13 0.85 4 21
#> 5151: DetSEIRwithNB MCMC 2020-07-13 0.90 2 21
#> 5152: SIRCOVID 2020-07-13 0.90 6 21
print(scoringutils::range_example_data_long, 3, 3)
#> value_date value_type geography value_desc true_value
#> 1: 2020-05-04 hospital_inc England Hospital admissions 1043
#> 2: 2020-05-04 hospital_prev England Total beds occupied 10648
#> 3: 2020-05-11 hospital_inc England Hospital admissions 743
#> ---
#> 5417: 2020-07-27 death_inc_line Wales Deaths 1
#> 5418: 2020-08-03 death_inc_line Wales Deaths 1
#> 5419: 2020-08-03 death_inc_line Wales Deaths 1
#> model creation_date prediction horizon boundary range
#> 1: <NA> <NA> NA NA <NA> NA
#> 2: <NA> <NA> NA NA <NA> NA
#> 3: <NA> <NA> NA NA <NA> NA
#> ---
#> 5417: SIRCOVID 2020-07-13 1 14 upper 0
#> 5418: DetSEIRwithNB MCMC 2020-07-13 0 21 upper 0
#> 5419: SIRCOVID 2020-07-13 1 21 upper 0
print(scoringutils::range_example_data_wide, 3, 3)
#> value_date value_type geography value_desc true_value
#> 1: 2020-05-04 death_inc_line England Deaths 448
#> 2: 2020-05-04 death_inc_line Northern Ireland Deaths 9
#> 3: 2020-05-04 death_inc_line Scotland Deaths 40
#> ---
#> 344: 2020-08-03 hospital_prev England Total beds occupied 784
#> 345: 2020-08-03 hospital_prev Scotland Total beds occupied 265
#> 346: 2020-08-03 icu_prev Scotland ICU beds occupied 3
#> model creation_date horizon lower_0 lower_10 lower_20
#> 1: <NA> <NA> NA NA NA NA
#> 2: <NA> <NA> NA NA NA NA
#> 3: <NA> <NA> NA NA NA NA
#> ---
#> 344: <NA> <NA> NA NA NA NA
#> 345: <NA> <NA> NA NA NA NA
#> 346: DetSEIRwithNB MCMC 2020-07-13 21 2 2 2
#> lower_30 lower_40 lower_50 lower_60 lower_70 lower_80 lower_90 upper_0
#> 1: NA NA NA NA NA NA NA NA
#> 2: NA NA NA NA NA NA NA NA
#> 3: NA NA NA NA NA NA NA NA
#> ---
#> 344: NA NA NA NA NA NA NA NA
#> 345: NA NA NA NA NA NA NA NA
#> 346: 2 2 1 1 1 1 0 2
#> upper_10 upper_20 upper_30 upper_40 upper_50 upper_60 upper_70 upper_80
#> 1: NA NA NA NA NA NA NA NA
#> 2: NA NA NA NA NA NA NA NA
#> 3: NA NA NA NA NA NA NA NA
#> ---
#> 344: NA NA NA NA NA NA NA NA
#> 345: NA NA NA NA NA NA NA NA
#> 346: 3 3 3 3 4 4 4 5
#> upper_90
#> 1: NA
#> 2: NA
#> 3: NA
#> ---
#> 344: NA
#> 345: NA
#> 346: 6
```

sample based format with either continuous or integer values

```
print(scoringutils::integer_example_data, 3, 3)
#> value_date value_type geography value_desc model
#> 1: 2020-05-04 hospital_inc England Hospital admissions <NA>
#> 2: 2020-05-04 hospital_prev England Total beds occupied <NA>
#> 3: 2020-05-11 hospital_inc England Hospital admissions <NA>
#> ---
#> 13427: 2020-08-03 death_inc_line Wales Deaths SIRCOVID
#> 13428: 2020-08-03 death_inc_line Wales Deaths SIRCOVID
#> 13429: 2020-08-03 death_inc_line Wales Deaths SIRCOVID
#> creation_date horizon prediction sample true_value
#> 1: <NA> NA NA NA 1043
#> 2: <NA> NA NA NA 10648
#> 3: <NA> NA NA NA 743
#> ---
#> 13427: 2020-07-13 21 0 48 1
#> 13428: 2020-07-13 21 0 49 1
#> 13429: 2020-07-13 21 0 50 1
print(scoringutils::continuous_example_data, 3, 3)
#> value_date value_type geography value_desc model
#> 1: 2020-05-04 hospital_inc England Hospital admissions <NA>
#> 2: 2020-05-04 hospital_prev England Total beds occupied <NA>
#> 3: 2020-05-11 hospital_inc England Hospital admissions <NA>
#> ---
#> 13427: 2020-08-03 death_inc_line Wales Deaths SIRCOVID
#> 13428: 2020-08-03 death_inc_line Wales Deaths SIRCOVID
#> 13429: 2020-08-03 death_inc_line Wales Deaths SIRCOVID
#> creation_date horizon prediction sample true_value
#> 1: <NA> NA NA NA 1043
#> 2: <NA> NA NA NA 10648
#> 3: <NA> NA NA NA 743
#> ---
#> 13427: 2020-07-13 21 0.3340917507 48 1
#> 13428: 2020-07-13 21 0.3540187438 49 1
#> 13429: 2020-07-13 21 0.0001998965 50 1
```

forecasts in a binary format:

```
print(scoringutils::binary_example_data, 3, 3)
#> value_date value_type geography value_desc model
#> 1: 2020-05-04 hospital_inc England Hospital admissions <NA>
#> 2: 2020-05-04 hospital_prev England Total beds occupied <NA>
#> 3: 2020-05-11 hospital_inc England Hospital admissions <NA>
#> ---
#> 344: 2020-07-27 death_inc_line Wales Deaths SIRCOVID
#> 345: 2020-08-03 death_inc_line Wales Deaths DetSEIRwithNB MCMC
#> 346: 2020-08-03 death_inc_line Wales Deaths SIRCOVID
#> creation_date horizon prediction true_value
#> 1: <NA> NA NA NA
#> 2: <NA> NA NA NA
#> 3: <NA> NA NA NA
#> ---
#> 344: 2020-07-13 14 0.34 0
#> 345: 2020-07-13 21 0.22 1
#> 346: 2020-07-13 21 0.26 0
```

It also offers functionality to convert between these formats. For more information have a look at the documentation of the following functions:

```
::sample_to_quantile() # convert from sample based to quantile format
scoringutils::range_long_to_quantile() # convert from range format to plain quantile
scoringutils::quantile_to_range_long() # convert the other way round
scoringutils::range_wide_to_long() # convert range based format from wide to long
scoringutils::range_long_to_wide() # convert the other way round scoringutils
```

A variety of metrics and scoring rules can also be accessed directly through the `scoringutils`

package.

The following gives an overview of (most of) the implemented metrics.

The function `bias`

determines bias from predictive Monte-Carlo samples, automatically recognising whether forecasts are continuous or integer valued.

For continuous forecasts, Bias is measured as \[B_t (P_t, x_t) = 1 - 2 \cdot (P_t (x_t))\]

where \(P_t\) is the empirical cumulative distribution function of the prediction for the true value \(x_t\). Computationally, \(P_t (x_t)\) is just calculated as the fraction of predictive samples for \(x_t\) that are smaller than \(x_t\).

For integer valued forecasts, Bias is measured as

\[B_t (P_t, x_t) = 1 - (P_t (x_t) + P_t (x_t + 1))\]

to adjust for the integer nature of the forecasts. In both cases, Bias can assume values between -1 and 1 and is 0 ideally.

```
## integer valued forecasts
<- rpois(30, lambda = 1:30)
true_values <- replicate(200, rpois(n = 30, lambda = 1:30))
predictions bias(true_values, predictions)
#> [1] -0.615 -0.105 -0.895 -0.125 0.000 -0.080 0.325 0.285 -0.945 -0.990
#> [11] 0.670 0.385 0.605 0.395 0.180 -0.875 0.270 0.655 0.840 0.085
#> [21] -0.955 0.370 -0.410 -0.115 0.690 -0.610 -0.780 -0.420 -0.610 -0.765
## continuous forecasts
<- rnorm(30, mean = 1:30)
true_values <- replicate(200, rnorm(30, mean = 1:30))
predictions bias(true_values, predictions)
#> [1] 0.17 0.09 0.86 0.87 0.37 -0.18 -0.84 0.28 -0.04 -0.40 0.85 0.74
#> [13] 0.63 -0.60 0.18 -0.98 1.00 -0.19 -0.61 0.40 -0.01 0.31 0.52 -0.52
#> [25] -0.35 0.58 0.37 -0.87 -0.83 -0.22
```

Calibration or reliability of forecasts is the ability of a model to correctly identify its own uncertainty in making predictions. In a model with perfect calibration, the observed data at each time point look as if they came from the predictive probability distribution at that time.

Equivalently, one can inspect the probability integral transform of the predictive distribution at time t,

\[u_t = F_t (x_t)\]

where \(x_t\) is the observed data point at time \(t \text{ in } t_1, …, t_n\), n being the number of forecasts, and \(F_t\) is the (continuous) predictive cumulative probability distribution at time t. If the true probability distribution of outcomes at time t is \(G_t\) then the forecasts \(F_t\) are said to be ideal if \(F_t = G_t\) at all times \(t\). In that case, the probabilities ut are distributed uniformly.

In the case of discrete outcomes such as incidence counts, the PIT is no longer uniform even when forecasts are ideal. In that case a randomised PIT can be used instead:

\[u_t = P_t(k_t) + v \cdot (P_t(k_t) - P_t(k_t - 1) )\]

where \(k_t\) is the observed count, \(P_t(x)\) is the predictive cumulative probability of observing incidence \(k\) at time \(t\), \(P_t (-1) = 0\) by definition and \(v\) is standard uniform and independent of \(k\). If \(P_t\) is the true cumulative probability distribution, then \(u_t\) is standard uniform.

The function checks whether integer or continuous forecasts were provided. It then applies the (randomised) probability integral and tests the values \(u_t\) for uniformity using the Anderson-Darling test.

As a rule of thumb, there is no evidence to suggest a forecasting model is miscalibrated if the p-value found was greater than a threshold of \(p >= 0.1\), some evidence that it was miscalibrated if \(0.01 < p < 0.1\), and good evidence that it was miscalibrated if \(p <= 0.01\). In this context it should be noted, though, that uniformity of the PIT is a necessary but not sufficient condition of calibration. It should als be noted that the test only works given sufficient samples, otherwise the Null hypothesis will often be rejected outright.

Wrapper around the `crps_sample`

function from the `scoringRules`

package. For more information look at the manuals from the `scoringRules`

package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.

```
<- rpois(30, lambda = 1:30)
true_values <- replicate(200, rpois(n = 30, lambda = 1:30))
predictions crps(true_values, predictions)
#> [1] 0.521050 0.633500 0.518600 0.642100 1.304650 1.546475 0.783700 2.565575
#> [9] 1.631875 1.188800 0.964525 1.299300 1.166925 0.876525 1.278900 1.164950
#> [17] 0.971600 3.270825 0.978750 1.317950 1.580125 1.142550 1.598300 1.433550
#> [25] 1.775475 4.722050 1.582800 5.441600 1.511750 1.465100
```

Wrapper around the `dss_sample`

function from the `scoringRules`

package. For more information look at the manuals from the `scoringRules`

package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.

```
<- rpois(30, lambda = 1:30)
true_values <- replicate(200, rpois(n = 30, lambda = 1:30))
predictions dss(true_values, predictions)
#> [1] 0.02469261 0.78390156 2.27761638 2.38684624 2.49340570 4.04028966
#> [7] 4.50257683 2.34778554 2.28442133 2.51475681 2.48235606 3.11642617
#> [13] 4.55349386 5.21757570 2.79817000 3.03363287 6.19321433 4.23659162
#> [19] 4.12915528 3.25309266 3.23198654 3.53401086 3.04884438 3.07643435
#> [25] 3.47692091 6.44457327 4.81439113 6.42344089 6.17941748 3.92567561
```

Wrapper around the `log_sample`

function from the `scoringRules`

package. For more information look at the manuals from the `scoringRules`

package. The function should not be used for integer valued forecasts. While Log Scores are in principle possible for integer valued forecasts they require a kernel density estimate which is not well defined for discrete values. Smaller values are better.

```
<- rnorm(30, mean = 1:30)
true_values <- replicate(200, rnorm(n = 30, mean = 1:30))
predictions logs(true_values, predictions)
#> [1] 1.3124651 1.5053204 1.0226910 3.9212832 1.0678010 2.1379663 1.3503934
#> [8] 1.6879367 1.2022707 0.8184592 1.0658446 1.7941600 1.4048805 1.5266651
#> [15] 1.9782177 1.2706816 1.7360610 1.0361746 1.1157336 1.1174888 1.8872156
#> [22] 1.7570332 1.1997242 1.1886026 0.9835935 1.6386980 0.9617363 2.2150274
#> [29] 1.4626931 1.7952708
```

The Brier score is a proper score rule that assesses the accuracy of probabilistic binary predictions. The outcomes can be either 0 or 1, the predictions must be a probability that the true outcome will be 1.

The Brier Score is then computed as the mean squared error between the probabilistic prediction and the true outcome.

\[\text{Brier_Score} = \frac{1}{N} \sum_{t = 1}^{n} (\text{prediction}_t - \text{outcome}_t)^2\]

```
<- sample(c(0,1), size = 30, replace = TRUE)
true_values <- runif(n = 30, min = 0, max = 1)
predictions
brier_score(true_values, predictions)
#> [1] 0.3935805
```

The Interval Score is a Proper Scoring Rule to score quantile predictions, following Gneiting and Raftery (2007). Smaller values are better.

The score is computed as

\[ \text{score} = (\text{upper} - \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{lower} - \text{true_value}) \cdot 1(\text{true_values} < \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{true_value} - \text{upper}) \cdot 1(\text{true_value} > \text{upper})\]

where \(1()\) is the indicator function and \(\alpha\) is the decimal value that indicates how much is outside the prediction interval. To improve usability, the user is asked to provide an interval range in percentage terms, i.e. interval_range = 90 (percent) for a 90 percent prediction interval. Correspondingly, the user would have to provide the 5% and 95% quantiles (the corresponding alpha would then be 0.1). No specific distribution is assumed, but the range has to be symmetric (i.e you can’t use the 0.1 quantile as the lower bound and the 0.7 quantile as the upper). Setting `weigh = TRUE`

will weigh the score by \(\frac{\alpha}{2}\) such that the Interval Score converges to the CRPS for increasing number of quantiles.

```
<- rnorm(30, mean = 1:30)
true_values <- 90
interval_range <- (100 - interval_range) / 100
alpha <- qnorm(alpha/2, rnorm(30, mean = 1:30))
lower <- qnorm((1- alpha/2), rnorm(30, mean = 1:30))
upper
interval_score(true_values = true_values,
lower = lower,
upper = upper,
interval_range = interval_range)
#> [1] 0.211132908 0.172013604 0.302209921 0.311720297 1.378214370 1.241004569
#> [7] 0.175782711 0.134653330 0.064041655 0.211989554 0.271432820 0.877267651
#> [13] 0.260809959 0.548756359 0.265993547 0.192761327 0.602578654 0.320843233
#> [19] 0.182406002 0.007189698 0.153981561 0.090803988 0.222892065 0.221680796
#> [25] 0.191490825 0.269578799 0.239607572 0.150371288 0.079979421 0.860197649
```