Support aggregation-and-transformation scoring framework for multivariate forecasts

## Motivation

Pic et al. (2025) formalise a framework for constructing interpretable multivariate proper scoring rules by combining transformations with univariate scores and aggregating the results.

### The framework (Corollary 1)

Let `T = {T_i}` be a set of transformations from R^d to R^k. Let `S_T = {S_{T_i}}` be proper scoring rules where `S` is proper relative to `T_i(F)` for all `i`. Let `w = {w_i}` be non-negative weights. Then:

```
S_{S_T, w}(F, y) = Σ_i  w_i * S_{T_i}(F, y)
```

is proper relative to F. Strict propriety is obtained as soon as there exists one `i` such that `S` is strictly proper relative to `T_i(F)`, `T_i` is injective, and `w_i > 0`.

More generally, aggregations can take the form of any (strictly) isotonic transformation (not just weighted sums), such as a multiplicative structure for positive scoring rules (Ziel and Berk, 2019).

### Key insight

Any kernel score (which encapsulates the Brier score, CRPS, energy score, and variogram score) can be expressed as an aggregation of squared errors between transformations of the forecast-observation pair (Appendix D of Allen et al.). This means existing multivariate scores are special cases:
- The **variogram score** is a pairwise difference transform + squared error
- The **energy score** decomposes into aggregated squared errors between transformed quantities

### Limitations of current approaches the framework addresses

- The energy score is more sensitive to mean misspecification than variance or dependence structure, with deteriorating discriminatory power in higher dimensions
- The variogram score cannot detect equal bias across all components
- Using multiple complementary scoring rules targeting specific forecast aspects is better than relying on a single strictly proper rule

### Connection to discussion #1069

In [discussion #1069](https://github.com/epiforecasts/scoringutils/discussions/1069), Nick Reich asked about scoring relative rankings across locations. In looking this up we also foundthe the MMD kernel score (`mmds_sample`) alongside the energy and variogram scores. We chose not to implement `mmds_sample` because it uses a fixed Gaussian kernel (σ=1) that doesn't work well with epi count data on different scales. However, the aggregation-and-transformation framework offers a path to making kernel-based scoring more practical: by transforming data first (e.g. log, standardise, threshold), the scale issues that make the raw MMD score unusable could be addressed. If we build better transform support, revisiting the kernel score becomes more viable. However, doing that would also enable creating things like it from composition so we might not even want to.

## Related research on weight and transform selection

### Weights

- **Bolin & Wallin (2023)** show that naive aggregation (e.g. averaging CRPS across locations) gives more importance to observations with large uncertainty, producing unintuitive rankings. They propose the scaled CRPS (SCRPS) which is locally scale invariant. Directly relevant to epi count data where different locations have very different case counts.
- For variogram score weights, Allen et al. note that weights proportional to inverse distance between locations can increase the signal-to-noise ratio (connecting to Matheron's geostatistics framework).

### Transforms

- **Allen, Ginsbourger & Ziegel (2023, SIAM/ASA JUQ)** develop *weighted* multivariate kernel scores using transformations to emphasise high-impact events. They show the threshold-weighted CRPS (twCRPS) is a kernel score, and extend this to multivariate settings. This directly addresses "which transforms matter" for decision-relevant evaluation, including compound events where several moderate values interact.
- Transform selection remains **application-specific and largely empirical** — no automated method exists. Allen et al. (2025) recommend starting simple (marginals, means, binary thresholds) and increasing complexity gradually.

### Open research questions

- How to systematically choose transforms for a given application
- How scale dependence in aggregated scores affects model rankings
- Bridging spatial verification methods with proper scoring rule theory

## What scoringutils already has

- `transform_forecasts()` supports arbitrary functions applied to observed/predicted with append and labelling, implementing the transformation principle
- The variogram score and energy score are available for multivariate sample forecasts (#1111, #1112)
- `score()` + `summarise_scores()` provides part of the aggregation step

## What could be added

### 1. Threshold exceedance transforms

A helper to convert continuous forecasts to binary exceedance forecasts at specified thresholds, enabling Brier score evaluation of event probabilities. For example, "did cases exceed 1000?" scored with the Brier score across locations jointly.

```r
# Conceptual API
transform_threshold(forecast, thresholds = c(100, 500, 1000))
```

### 2. Pairwise difference transforms

A helper to compute `|x_i - x_j|^p` across components within multivariate groups, making the variogram-like transform reusable with different base scores (not just squared error).

### 3. Marginal decomposition

A helper to extract and score individual components (marginals) from a multivariate forecast separately, then aggregate. This naturally handles different scales since each component is scored on its own scale.

### 4. Scaled/weighted score aggregation

Support for combining multiple scoring rules with user-specified weights, and potentially scale-invariant aggregation following Bolin & Wallin (2023). The framework guarantees propriety of weighted sums.

### 5. Documentation / vignette

A vignette showing how to use the framework in practice with epi data, demonstrating:
- Marginal CRPS per location + aggregation (handles scale differences)
- Threshold exceedance scoring (decision-relevant)
- Energy score + variogram score together (complementary views)
- Log/sqrt transforms before multivariate scoring (variance stabilisation)
- How these relate to the Allen et al. framework and why combining complementary scores is better than relying on any single rule

## Scope

This is exploratory. Some of these may already be achievable with existing tools (e.g. threshold transforms via `transform_forecasts()` + `as_forecast_binary()`). The main value may be in documentation and a few convenience helpers rather than new infrastructure.

## Related issues

- #1069 — Original discussion on variogram scores and relative ranking
- #1046 — Multivariate scoring in scoring rules vignette
- #1116 — Update multivariate scoring vignette with variogram score and point forecasts
- #1111 — Implement variogram scores
- #1112 — Multivariate point class
- #288 — Original multivariate scoring extension (closed, implemented energy score)

## References

- Allen, S., Ginsbourger, D., and Ziegel, J. (2023). Evaluating forecasts for high-impact events using transformed kernel scores. *SIAM/ASA Journal on Uncertainty Quantification*, 11(3), 906-940. https://epubs.siam.org/doi/full/10.1137/22M1532184
- Pic R, Dombry C, Naveau P, Taillardat M. Proper scoring rules for multivariate probabilistic forecasts based on aggregation and transformation. Advances in Statistical Climatology, Meteorology and Oceanography. 2025;11(1):23–58. doi:[10.5194/ascmo-11-23-2025](https://doi.org/10.5194/ascmo-11-23-2025)
- Bolin, D. and Wallin, J. (2023). Local scale invariance and robustness of proper scoring rules. *Statistical Science*, 38(1). https://arxiv.org/abs/1912.05642
- Ziel, F. and Berk, K. (2019). Multivariate forecasting evaluation: On sensitive and strictly proper scoring rules. https://arxiv.org/abs/1910.07325
- Scheuerer, M. and Hamill, T. M. (2015). Variogram-based proper scoring rules for probabilistic forecasts of multivariate quantities. *Monthly Weather Review*, 143(4), 1321-1334.

This was opened by a bot. Please ping @seabbs for any questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support aggregation-and-transformation scoring framework for multivariate forecasts #1120

Motivation

The framework (Corollary 1)

Key insight

Limitations of current approaches the framework addresses

Connection to discussion #1069

Related research on weight and transform selection

Weights

Transforms

Open research questions

What scoringutils already has

What could be added

1. Threshold exceedance transforms

2. Pairwise difference transforms

3. Marginal decomposition

4. Scaled/weighted score aggregation

5. Documentation / vignette

Scope

Related issues

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support aggregation-and-transformation scoring framework for multivariate forecasts #1120

Description

Motivation

The framework (Corollary 1)

Key insight

Limitations of current approaches the framework addresses

Connection to discussion #1069

Related research on weight and transform selection

Weights

Transforms

Open research questions

What scoringutils already has

What could be added

1. Threshold exceedance transforms

2. Pairwise difference transforms

3. Marginal decomposition

4. Scaled/weighted score aggregation

5. Documentation / vignette

Scope

Related issues

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions