Skip to content

Conversation

@daphne-cornelisse
Copy link

@daphne-cornelisse daphne-cornelisse commented Jan 14, 2026

Description

We observed a discrepancy between the WOSAC meta-scores from the original implementation and those produced by PufferDrive. This PR resolves these discrepancies by fixing the bugs below.

Bugs fixed

  • Order of operations: We averaged log-likelihood metrics before exponentiating; WOSAC exponentiates first and then averages.

To do

  • There is still a discrepancy of 0.03 between the SMART meta-score in Pufferdrive vs. the WOSAC leaderboard
  • We need to evaluate a random agent in the leaderboard
  • Update the baselines (clean small dataset, larger one)

Tests ran and results

TODO

@daphne-cornelisse daphne-cornelisse added bug Something isn't working benchmarking documentation Improvements or additions to documentation labels Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmarking bug Something isn't working documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants