ENG-3243 Add hosted eval export command#647
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dac3210220
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if split is not None and split != 1: | ||
| console.print("[red]Error:[/red] split exports are not available for this run") |
There was a problem hiding this comment.
Reject unsupported split selection consistently
The --split flag is documented as selecting a specific env config set, but this guard only errors for values other than 1, so --split 1 is silently accepted and then ignored during export. In runs that actually contain multiple config sets, users will think they exported a subset while the command writes all rollouts, which can pollute downstream training/eval datasets. Until split-aware filtering is implemented, any non-None split value should fail explicitly.
Useful? React with 👍 / 👎.
Adds
prime eval exportfor verifiers JSONL and Inspect.evalexports with reward filtering and docs.Note
Medium Risk
Adds a new
prime eval exportcommand that fetches and serializes hosted eval samples to disk, including pagination and filtering; mistakes could lead to incomplete/incorrect exports or large-memory runs when exporting big evaluations.Overview
Adds
prime eval export <run-id>to download all hosted evaluation samples, filter them (failed rollouts and reward thresholds), and write them out as either verifiers JSONL or a zipped Inspect.eval(withlog.json).Includes new
eval_exportutilities for normalizing sample/message shapes and deriving metadata, plus CLI logic to resolve a run ID to an evaluation ID, prevent exporting active runs, and page through samples. Updates the README with usage examples and adds tests covering verifiers row shape, Inspect output creation, and reward filtering.Reviewed by Cursor Bugbot for commit dac3210. Bugbot is set up for automated code reviews on this repo. Configure here.