Skip to content

A0 benchmarking framework#1063

Open
TerminallyLazy wants to merge 2 commits intoagent0ai:developmentfrom
TerminallyLazy:A0-tasks
Open

A0 benchmarking framework#1063
TerminallyLazy wants to merge 2 commits intoagent0ai:developmentfrom
TerminallyLazy:A0-tasks

Conversation

@TerminallyLazy
Copy link
Contributor

Summary

  • Adds a complete benchmarking framework ("Project B") that orchestrates subordinate agents through configurable tasks, tracks LLM usage metrics, and scores results via per-task evaluation scripts
  • Ships with 4 evaluation tasks spanning math, web APIs, file I/O, and git/package management — plus weighted task set definitions for running them as suites
  • Includes 9 extension hooks for metrics tracking and memory isolation to ensure reproducible, uncontaminated benchmark runs

Architecture

The framework follows an orchestrator-subordinate pattern:

  1. Agent 0 receives a user request and calls the start_task tool exclusively
  2. start_task creates an isolated run/<N>/ working directory, copies task assets, and spawns a subordinate agent that shares the same context
  3. A code_execution_tool override forces the subordinate's CWD into the run directory, sandboxing all file I/O
  4. 9 extension hooks on the subordinate agent:
    - 4 metric extensions track LLM calls, input tokens, output tokens, and reasoning tokens in real-time via delta-cursor accumulation on the shared benchmark_project_state
    - 5 memory extensions suppress all recall and memorize operations on the subordinate to prevent cross-run contamination and memory pollution
  5. After the subordinate completes, start_task dynamically loads and runs the task's evaluate.py to produce a scored result
  6. Results are saved as structured JSON to results/task-<N>.json or results/set-<N>.json with checkpoint saves for fault tolerance

Evaluation Tasks

  │     Task     │ Difficulty │ Multiplier │                                Tests                                │
  ├──────────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────┤
  │ simple_math  │ Easy       │ 1.0x       │ Arithmetic + file write; partial credit by proximity                │
  ├──────────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────┤
  │ eth_balance  │ Medium     │ 1.5x       │ Web API calls, JSON structure, address matching, balance validation │
  ├──────────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────┤
  │ stock_prices │ Medium     │ 1.5x       │ Data retrieval, CSV formatting, date/symbol coverage                │
  ├──────────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────┤
  │ git_project  │ Hard       │ 2.0x       │ Git clone, dependency install, CLI execution, output parsing        │
  └──────────────┴────────────┴────────────┴─────────────────────────────────────────────────────────────────────┘

Each task defines instructions.md (given to the subordinate) and evaluate.py (a scoring function with a scorecard breakdown). Tasks may optionally include initialize.py (pre-run setup) and an assets/ directory (files copied into the run directory).

Key Components

start_task.py (460 lines) — The core orchestration tool:

  • Resolves tasks from single name, array, or JSON task set file
  • Builds model config with None-safe fallback merging (_kwarg_or)
  • Supports both sync and async task scripts via importlib dynamic loading
  • Formats markdown result reports with per-task scores, weighted averages, and token usage stats

Task sets (task_sets/):

  • basic.json — Single simple_math task for quick iteration
  • full.json — All 4 tasks with difficulty-weighted multipliers (max weighted score = 600)

Extensions (9 files across 7 hook points):

  • benchmark_stats.py — LLM call counter and timestamps
  • benchmark_tokens.py — Input token accumulation
  • benchmark_output_tokens.py — Output token delta tracking
  • benchmark_reasoning_tokens.py — Reasoning token delta tracking
  • memory_override.py — Sets _benchmark_skip_memory flag on subordinate init
  • _50_recall_memories.py, _91_recall_wait.py — Suppress memory recall
  • _50_memorize_fragments.py, _51_memorize_solutions.py — Suppress memory writes

Design Decisions

  • Memory isolation over shared memory: Subordinate agents skip all memory operations during benchmarks. This prevents a subordinate from recalling answers from prior runs and avoids polluting long-term storage with benchmark artifacts.
  • Shared context, isolated filesystem: The subordinate shares Agent 0's context (so extensions can read benchmark_project_state) but its CWD is
    forced to an isolated run directory.
  • Scorecard evaluators: Each evaluator returns a detailed breakdown (not just pass/fail), enabling partial credit and diagnostic feedback. Scores
    are clamped to [0, 100] and support weighted multipliers for difficulty scaling.
  • Checkpoint saves: Task set results are written after each task completes, so partial results survive crashes.

Files Changed

  • 25 files changed, 1308 insertions(+)

All files are new additions under usr/projects/project_b/.

- Introduced project.json for project configuration.
- Added extensions for memory override, benchmark token tracking, and stats collection.
- Implemented task evaluation scripts for simple math, stock prices, and ETH balance tasks.
- Created instructions and expected output formats for each task.
- Added a task set JSON for running multiple tasks with score multipliers.
- Added helper functions for safe score conversion and retrieving keyword arguments with defaults.
- Improved task name handling to support both single tasks and task sets.
- Streamlined model configuration building using the new helper functions.
- Introduced a method to scan directories for execution numbers, reducing code duplication.
- Enhanced parameter construction for task execution, making it more modular and maintainable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments