Simplifying Scorer Evaluation #2

rlundeen2 · 2025-12-28T00:50:39Z

Refactors the scorer evaluation framework to provide a simpler, more user-friendly API. Scorers now have direct evaluate_async() and get_scorer_metrics() methods, eliminating the need for manual evaluator instantiation.

Simplified API

New ScorerEvalDatasetFiles Configuration
Maps input dataset file patterns (glob) to output result file
Scorers define default evaluation_file_mapping; users can override
harm_category now part of config (required for harm scorers)

Standardized CSV Format

Standard column names auto-detected: assistant_response, human_score, objective/harm_category, data_type
No longer need to specify column names in API calls

Added harm definitions/versions to HarmScorerMetrics

this is important to know what these were rated against
Added the HarmDefinition class to manage this

New scorer_metrics_io.py Module

Thread-safe utilities for reading/writing metrics to JSONL files
get_all_objective_metrics() for browsing/comparing scorer configurations
find_metrics_by_hash() for looking up specific configurations

Removed Components

Deleted: scorer_evals.ipynb/.py (replaced by improved 8_scorer_metrics.ipynb)
Deleted: config_eval_datasets.py, ScorerMetricsRegistry class
Removed complex run_evaluation_from_csv_async() with many parameters

Added evaluate_scorers script

to get some initial scorers in process
Already interesting!
When we update our target stuff (which is coming soon) it might we worth updating these vs re-running because it takes a long time.

Scorer Printer

Adds Scorer Printer class for Scenarios and scorers to display results

Documentation

Rewrote 8_scorer_metrics.ipynb with clear metric explanations and practical examples

TODO this PR (before mergint to main)

Review carefully
generate .py scorer_metrics notebook
Run integration tests and e2e tests to make sure all are passing
pre-commit, etc.

TODO (future PRs)

Add debug info. After an evaluation, can we run it in a way we see what was missed?
Add ScenarioRegistry call so we make sure we evaluate all scorers for all scenarios
Add accurate models, I'd update the existing metrics rather than rerunning

…e#1270) Co-authored-by: hannahwestra25 <hannahwestra@microsoft.com> Co-authored-by: Roman Lutz <romanlutz13@gmail.com>

hannahwestra25 and others added 19 commits December 23, 2025 13:24

FIX fix permission denied error when creating env (Azure#1279)

26ffba3

MAINT remove dispose memory engine calls in docs (Azure#1278)

5fc9ce1

FIX: Updating Pipelines (Azure#1282)

f51235e

MAINT: Updating AttackExecutor to more generically call attacks (Azur…

defb46f

…e#1270) Co-authored-by: hannahwestra25 <hannahwestra@microsoft.com> Co-authored-by: Roman Lutz <romanlutz13@gmail.com>

workflow in place

c5e4543

dataset refactor

0cd7613

moving things

55a0383

finalizing things

e1fca4d

saving changes

7b5a3e4

generalizing return types

5a3796b

harm category additions

be1ea4c

moving paths

0afacd8

Piping harm scorers through

9dd8e63

updating csv structure

249d574

most things working

24a6517

adding scorer printer

eab659e

fixing tests

0c6dac3

merging main

61a31ec

pre-commit

e9faea3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplifying Scorer Evaluation #2

Simplifying Scorer Evaluation #2

Uh oh!

rlundeen2 commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Simplifying Scorer Evaluation #2

Are you sure you want to change the base?

Simplifying Scorer Evaluation #2

Uh oh!

Conversation

rlundeen2 commented Dec 28, 2025

Simplified API

Standardized CSV Format

Added harm definitions/versions to HarmScorerMetrics

New scorer_metrics_io.py Module

Removed Components

Added evaluate_scorers script

Scorer Printer

Documentation

TODO this PR (before mergint to main)

TODO (future PRs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants