A0 benchmarking framework#1063
Open
TerminallyLazy wants to merge 2 commits intoagent0ai:developmentfrom
Open
Conversation
- Introduced project.json for project configuration. - Added extensions for memory override, benchmark token tracking, and stats collection. - Implemented task evaluation scripts for simple math, stock prices, and ETH balance tasks. - Created instructions and expected output formats for each task. - Added a task set JSON for running multiple tasks with score multipliers.
- Added helper functions for safe score conversion and retrieving keyword arguments with defaults. - Improved task name handling to support both single tasks and task sets. - Streamlined model configuration building using the new helper functions. - Introduced a method to scan directories for execution numbers, reducing code duplication. - Enhanced parameter construction for task execution, making it more modular and maintainable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Architecture
The framework follows an orchestrator-subordinate pattern:
start_taskcreates an isolatedrun/<N>/working directory, copies task assets, and spawns a subordinate agent that shares the same contextcode_execution_tooloverride forces the subordinate's CWD into the run directory, sandboxing all file I/O- 4 metric extensions track LLM calls, input tokens, output tokens, and reasoning tokens in real-time via delta-cursor accumulation on the shared
benchmark_project_state- 5 memory extensions suppress all recall and memorize operations on the subordinate to prevent cross-run contamination and memory pollution
start_taskdynamically loads and runs the task'sevaluate.pyto produce a scored resultresults/task-<N>.jsonorresults/set-<N>.jsonwith checkpoint saves for fault toleranceEvaluation Tasks
Each task defines
instructions.md(given to the subordinate) andevaluate.py(a scoring function with a scorecard breakdown). Tasks may optionally includeinitialize.py(pre-run setup) and anassets/directory (files copied into the run directory).Key Components
start_task.py(460 lines) — The core orchestration tool:_kwarg_or)importlibdynamic loadingTask sets (
task_sets/):basic.json— Singlesimple_mathtask for quick iterationfull.json— All 4 tasks with difficulty-weighted multipliers (max weighted score = 600)Extensions (9 files across 7 hook points):
benchmark_stats.py— LLM call counter and timestampsbenchmark_tokens.py— Input token accumulationbenchmark_output_tokens.py— Output token delta trackingbenchmark_reasoning_tokens.py— Reasoning token delta trackingmemory_override.py— Sets_benchmark_skip_memoryflag on subordinate init_50_recall_memories.py,_91_recall_wait.py— Suppress memory recall_50_memorize_fragments.py,_51_memorize_solutions.py— Suppress memory writesDesign Decisions
benchmark_project_state) but its CWD isforced to an isolated run directory.
are clamped to
[0, 100]and support weighted multipliers for difficulty scaling.Files Changed
All files are new additions under
usr/projects/project_b/.