A0 benchmarking framework by TerminallyLazy · Pull Request #1063 · agent0ai/agent-zero

TerminallyLazy · 2026-02-17T09:50:59Z

Summary

Adds a complete benchmarking framework ("Project B") that orchestrates subordinate agents through configurable tasks, tracks LLM usage metrics, and scores results via per-task evaluation scripts
Ships with 4 evaluation tasks spanning math, web APIs, file I/O, and git/package management — plus weighted task set definitions for running them as suites
Includes 9 extension hooks for metrics tracking and memory isolation to ensure reproducible, uncontaminated benchmark runs

Architecture

The framework follows an orchestrator-subordinate pattern:

Agent 0 receives a user request and calls the start_task tool exclusively
start_task creates an isolated run/<N>/ working directory, copies task assets, and spawns a subordinate agent that shares the same context
A code_execution_tool override forces the subordinate's CWD into the run directory, sandboxing all file I/O
9 extension hooks on the subordinate agent:
- 4 metric extensions track LLM calls, input tokens, output tokens, and reasoning tokens in real-time via delta-cursor accumulation on the shared benchmark_project_state
- 5 memory extensions suppress all recall and memorize operations on the subordinate to prevent cross-run contamination and memory pollution
After the subordinate completes, start_task dynamically loads and runs the task's evaluate.py to produce a scored result
Results are saved as structured JSON to results/task-<N>.json or results/set-<N>.json with checkpoint saves for fault tolerance

Evaluation Tasks

  │     Task     │ Difficulty │ Multiplier │                                Tests                                │
  ├──────────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────┤
  │ simple_math  │ Easy       │ 1.0x       │ Arithmetic + file write; partial credit by proximity                │
  ├──────────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────┤
  │ eth_balance  │ Medium     │ 1.5x       │ Web API calls, JSON structure, address matching, balance validation │
  ├──────────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────┤
  │ stock_prices │ Medium     │ 1.5x       │ Data retrieval, CSV formatting, date/symbol coverage                │
  ├──────────────┼────────────┼────────────┼─────────────────────────────────────────────────────────────────────┤
  │ git_project  │ Hard       │ 2.0x       │ Git clone, dependency install, CLI execution, output parsing        │
  └──────────────┴────────────┴────────────┴─────────────────────────────────────────────────────────────────────┘

Each task defines instructions.md (given to the subordinate) and evaluate.py (a scoring function with a scorecard breakdown). Tasks may optionally include initialize.py (pre-run setup) and an assets/ directory (files copied into the run directory).

Key Components

start_task.py (460 lines) — The core orchestration tool:

Resolves tasks from single name, array, or JSON task set file
Builds model config with None-safe fallback merging (_kwarg_or)
Supports both sync and async task scripts via importlib dynamic loading
Formats markdown result reports with per-task scores, weighted averages, and token usage stats

Task sets (`task_sets/`):

basic.json — Single simple_math task for quick iteration
full.json — All 4 tasks with difficulty-weighted multipliers (max weighted score = 600)

Extensions (9 files across 7 hook points):

benchmark_stats.py — LLM call counter and timestamps
benchmark_tokens.py — Input token accumulation
benchmark_output_tokens.py — Output token delta tracking
benchmark_reasoning_tokens.py — Reasoning token delta tracking
memory_override.py — Sets _benchmark_skip_memory flag on subordinate init
_50_recall_memories.py, _91_recall_wait.py — Suppress memory recall
_50_memorize_fragments.py, _51_memorize_solutions.py — Suppress memory writes

Design Decisions

Memory isolation over shared memory: Subordinate agents skip all memory operations during benchmarks. This prevents a subordinate from recalling answers from prior runs and avoids polluting long-term storage with benchmark artifacts.
Shared context, isolated filesystem: The subordinate shares Agent 0's context (so extensions can read benchmark_project_state) but its CWD is
forced to an isolated run directory.
Scorecard evaluators: Each evaluator returns a detailed breakdown (not just pass/fail), enabling partial credit and diagnostic feedback. Scores
are clamped to [0, 100] and support weighted multipliers for difficulty scaling.
Checkpoint saves: Task set results are written after each task completes, so partial results survive crashes.

Files Changed

25 files changed, 1308 insertions(+)

All files are new additions under `usr/projects/project_b/`.

- Introduced project.json for project configuration. - Added extensions for memory override, benchmark token tracking, and stats collection. - Implemented task evaluation scripts for simple math, stock prices, and ETH balance tasks. - Created instructions and expected output formats for each task. - Added a task set JSON for running multiple tasks with score multipliers.

- Added helper functions for safe score conversion and retrieving keyword arguments with defaults. - Improved task name handling to support both single tasks and task sets. - Streamlined model configuration building using the new helper functions. - Introduced a method to scan directories for execution numbers, reducing code duplication. - Enhanced parameter construction for task execution, making it more modular and maintainable.

TerminallyLazy added 2 commits February 17, 2026 03:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A0 benchmarking framework#1063

A0 benchmarking framework#1063
TerminallyLazy wants to merge 2 commits intoagent0ai:developmentfrom
TerminallyLazy:A0-tasks

TerminallyLazy commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Uh oh!

Conversation

TerminallyLazy commented Feb 17, 2026

Summary

Architecture

Evaluation Tasks

Key Components

Task sets (task_sets/):

Extensions (9 files across 7 hook points):

Design Decisions

Files Changed

All files are new additions under usr/projects/project_b/.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Task sets (`task_sets/`):

All files are new additions under `usr/projects/project_b/`.