Skip to content

Conversation

@rlundeen2
Copy link

Refactors the scorer evaluation framework to provide a simpler, more user-friendly API. Scorers now have direct evaluate_async() and get_scorer_metrics() methods, eliminating the need for manual evaluator instantiation.

Simplified API

  • New ScorerEvalDatasetFiles Configuration
  • Maps input dataset file patterns (glob) to output result file
  • Scorers define default evaluation_file_mapping; users can override
  • harm_category now part of config (required for harm scorers)

Standardized CSV Format

  • Standard column names auto-detected: assistant_response, human_score, objective/harm_category, data_type
  • No longer need to specify column names in API calls

Added harm definitions/versions to HarmScorerMetrics

  • this is important to know what these were rated against
  • Added the HarmDefinition class to manage this

New scorer_metrics_io.py Module

  • Thread-safe utilities for reading/writing metrics to JSONL files
  • get_all_objective_metrics() for browsing/comparing scorer configurations
    find_metrics_by_hash() for looking up specific configurations

Removed Components

  • Deleted: scorer_evals.ipynb/.py (replaced by improved 8_scorer_metrics.ipynb)
  • Deleted: config_eval_datasets.py, ScorerMetricsRegistry class
  • Removed complex run_evaluation_from_csv_async() with many parameters

Added evaluate_scorers script

  • to get some initial scorers in process
  • Already interesting!
  • When we update our target stuff (which is coming soon) it might we worth updating these vs re-running because it takes a long time.

Scorer Printer

  • Adds Scorer Printer class for Scenarios and scorers to display results

Documentation

  • Rewrote 8_scorer_metrics.ipynb with clear metric explanations and practical examples

TODO this PR (before mergint to main)

  • Review carefully
  • generate .py scorer_metrics notebook
  • Run integration tests and e2e tests to make sure all are passing
  • pre-commit, etc.

TODO (future PRs)

  • Add debug info. After an evaluation, can we run it in a way we see what was missed?
  • Add ScenarioRegistry call so we make sure we evaluate all scorers for all scenarios
  • Add accurate models, I'd update the existing metrics rather than rerunning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants