Skip to content

ENG-3243 Add hosted eval export command#647

Open
d42me wants to merge 1 commit into
mainfrom
feature/add-export-for-hosted
Open

ENG-3243 Add hosted eval export command#647
d42me wants to merge 1 commit into
mainfrom
feature/add-export-for-hosted

Conversation

@d42me
Copy link
Copy Markdown
Contributor

@d42me d42me commented May 13, 2026

Adds prime eval export for verifiers JSONL and Inspect .eval exports with reward filtering and docs.


Note

Medium Risk
Adds a new prime eval export command that fetches and serializes hosted eval samples to disk, including pagination and filtering; mistakes could lead to incomplete/incorrect exports or large-memory runs when exporting big evaluations.

Overview
Adds prime eval export <run-id> to download all hosted evaluation samples, filter them (failed rollouts and reward thresholds), and write them out as either verifiers JSONL or a zipped Inspect .eval (with log.json).

Includes new eval_export utilities for normalizing sample/message shapes and deriving metadata, plus CLI logic to resolve a run ID to an evaluation ID, prevent exporting active runs, and page through samples. Updates the README with usage examples and adds tests covering verifiers row shape, Inspect output creation, and reward filtering.

Reviewed by Cursor Bugbot for commit dac3210. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dac3210220

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1035 to +1036
if split is not None and split != 1:
console.print("[red]Error:[/red] split exports are not available for this run")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject unsupported split selection consistently

The --split flag is documented as selecting a specific env config set, but this guard only errors for values other than 1, so --split 1 is silently accepted and then ignored during export. In runs that actually contain multiple config sets, users will think they exported a subset while the command writes all rollouts, which can pollute downstream training/eval datasets. Until split-aware filtering is implemented, any non-None split value should fail explicitly.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant