Skip to content

Support aggregation-and-transformation scoring framework for multivariate forecasts #1120

@seabbs-bot

Description

@seabbs-bot

Motivation

Pic et al. (2025) formalise a framework for constructing interpretable multivariate proper scoring rules by combining transformations with univariate scores and aggregating the results.

The framework (Corollary 1)

Let T = {T_i} be a set of transformations from R^d to R^k. Let S_T = {S_{T_i}} be proper scoring rules where S is proper relative to T_i(F) for all i. Let w = {w_i} be non-negative weights. Then:

S_{S_T, w}(F, y) = Σ_i  w_i * S_{T_i}(F, y)

is proper relative to F. Strict propriety is obtained as soon as there exists one i such that S is strictly proper relative to T_i(F), T_i is injective, and w_i > 0.

More generally, aggregations can take the form of any (strictly) isotonic transformation (not just weighted sums), such as a multiplicative structure for positive scoring rules (Ziel and Berk, 2019).

Key insight

Any kernel score (which encapsulates the Brier score, CRPS, energy score, and variogram score) can be expressed as an aggregation of squared errors between transformations of the forecast-observation pair (Appendix D of Allen et al.). This means existing multivariate scores are special cases:

  • The variogram score is a pairwise difference transform + squared error
  • The energy score decomposes into aggregated squared errors between transformed quantities

Limitations of current approaches the framework addresses

  • The energy score is more sensitive to mean misspecification than variance or dependence structure, with deteriorating discriminatory power in higher dimensions
  • The variogram score cannot detect equal bias across all components
  • Using multiple complementary scoring rules targeting specific forecast aspects is better than relying on a single strictly proper rule

Connection to discussion #1069

In discussion #1069, Nick Reich asked about scoring relative rankings across locations. In looking this up we also foundthe the MMD kernel score (mmds_sample) alongside the energy and variogram scores. We chose not to implement mmds_sample because it uses a fixed Gaussian kernel (σ=1) that doesn't work well with epi count data on different scales. However, the aggregation-and-transformation framework offers a path to making kernel-based scoring more practical: by transforming data first (e.g. log, standardise, threshold), the scale issues that make the raw MMD score unusable could be addressed. If we build better transform support, revisiting the kernel score becomes more viable. However, doing that would also enable creating things like it from composition so we might not even want to.

Related research on weight and transform selection

Weights

  • Bolin & Wallin (2023) show that naive aggregation (e.g. averaging CRPS across locations) gives more importance to observations with large uncertainty, producing unintuitive rankings. They propose the scaled CRPS (SCRPS) which is locally scale invariant. Directly relevant to epi count data where different locations have very different case counts.
  • For variogram score weights, Allen et al. note that weights proportional to inverse distance between locations can increase the signal-to-noise ratio (connecting to Matheron's geostatistics framework).

Transforms

  • Allen, Ginsbourger & Ziegel (2023, SIAM/ASA JUQ) develop weighted multivariate kernel scores using transformations to emphasise high-impact events. They show the threshold-weighted CRPS (twCRPS) is a kernel score, and extend this to multivariate settings. This directly addresses "which transforms matter" for decision-relevant evaluation, including compound events where several moderate values interact.
  • Transform selection remains application-specific and largely empirical — no automated method exists. Allen et al. (2025) recommend starting simple (marginals, means, binary thresholds) and increasing complexity gradually.

Open research questions

  • How to systematically choose transforms for a given application
  • How scale dependence in aggregated scores affects model rankings
  • Bridging spatial verification methods with proper scoring rule theory

What scoringutils already has

What could be added

1. Threshold exceedance transforms

A helper to convert continuous forecasts to binary exceedance forecasts at specified thresholds, enabling Brier score evaluation of event probabilities. For example, "did cases exceed 1000?" scored with the Brier score across locations jointly.

# Conceptual API
transform_threshold(forecast, thresholds = c(100, 500, 1000))

2. Pairwise difference transforms

A helper to compute |x_i - x_j|^p across components within multivariate groups, making the variogram-like transform reusable with different base scores (not just squared error).

3. Marginal decomposition

A helper to extract and score individual components (marginals) from a multivariate forecast separately, then aggregate. This naturally handles different scales since each component is scored on its own scale.

4. Scaled/weighted score aggregation

Support for combining multiple scoring rules with user-specified weights, and potentially scale-invariant aggregation following Bolin & Wallin (2023). The framework guarantees propriety of weighted sums.

5. Documentation / vignette

A vignette showing how to use the framework in practice with epi data, demonstrating:

  • Marginal CRPS per location + aggregation (handles scale differences)
  • Threshold exceedance scoring (decision-relevant)
  • Energy score + variogram score together (complementary views)
  • Log/sqrt transforms before multivariate scoring (variance stabilisation)
  • How these relate to the Allen et al. framework and why combining complementary scores is better than relying on any single rule

Scope

This is exploratory. Some of these may already be achievable with existing tools (e.g. threshold transforms via transform_forecasts() + as_forecast_binary()). The main value may be in documentation and a few convenience helpers rather than new infrastructure.

Related issues

References

  • Allen, S., Ginsbourger, D., and Ziegel, J. (2023). Evaluating forecasts for high-impact events using transformed kernel scores. SIAM/ASA Journal on Uncertainty Quantification, 11(3), 906-940. https://epubs.siam.org/doi/full/10.1137/22M1532184
  • Pic R, Dombry C, Naveau P, Taillardat M. Proper scoring rules for multivariate probabilistic forecasts based on aggregation and transformation. Advances in Statistical Climatology, Meteorology and Oceanography. 2025;11(1):23–58. doi:10.5194/ascmo-11-23-2025
  • Bolin, D. and Wallin, J. (2023). Local scale invariance and robustness of proper scoring rules. Statistical Science, 38(1). https://arxiv.org/abs/1912.05642
  • Ziel, F. and Berk, K. (2019). Multivariate forecasting evaluation: On sensitive and strictly proper scoring rules. https://arxiv.org/abs/1910.07325
  • Scheuerer, M. and Hamill, T. M. (2015). Variogram-based proper scoring rules for probabilistic forecasts of multivariate quantities. Monthly Weather Review, 143(4), 1321-1334.

This was opened by a bot. Please ping @seabbs for any questions.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions