This guide documents the end-to-end workflow for measuring how agents improve from execution experience. Atlas captures execution traces with reward signals, distills successful patterns into playbook entries, and measures transfer learning across tasks.
The runtime automatically summarizes execution metadata to fit within model context limits. This prevents oversized prompts from failing with providers that have smaller context windows while preserving full telemetry on disk.
- The OpenAI-compatible adapter exposes a new
metadata_digestblock. Defaults trim large sections (reward audits, session trajectories, validation blobs) to a high-signal summary capped at roughly 10% of each provider's context window (≈20k characters for Anthropic). - Each digest produced for an LLM includes
digest_stats(budget used, omitted metadata keys, and any sections dropped to stay under budget). These diagnostics live in the system message payload for troubleshooting. - Override defaults per workflow:
agent:
adapter:
type: litellm
metadata_digest:
char_budget: 24000 # Optional hard cap for every provider
provider_char_budgets:
anthropic: 18000 # Override Claude/Sonnet to stay safely below 200k tokens
max_plan_steps: 6 # Control how many plan steps appear in the digest
max_learning_history_entries: 2
include_session_keys: [source, execution_mode, token_usage, reward_stats]- Set
enabled: falseto revert to the legacy behaviour (not recommended for Claude/Bison-sized windows). - If the digest cannot fit under the configured budget after trimming optional sections it raises a descriptive error instead of attempting to send the oversized payload.
Gemini continues to receive the same or smaller prompts, while Anthropic and other providers now stay well within their context limits during benchmarking runs.
Learning updates now revolve around structured playbook entries. Each playbook entry captures:
- cue – regex/keyword trigger that can be machine-detected.
- action – imperative phrasing plus the runtime handle/tool mapping.
- expected_effect – why the action matters.
- scope – whether the playbook entry reinforces an existing behaviour or introduces differentiation, including any constraints.
- provenance – session id, teacher intervention digest, rubric scores, and lifecycle (
active,deprecated,rejected).
Three rubric gates run on every synthesis:
- Actionability – the handle must map to a real tool and the imperative cannot be empty.
- Cue presence – cues must be machine-detectable (valid regex/keyword/predicate).
- Generality – no incident IDs/dates or overfit proper nouns; playbook entries must respect a length budget.
Scores for actionability, generality, hookability, and concision (weights: 0.4 / 0.3 / 0.2 / 0.1) are computed even when gates fail. If any gate fails, the entry is rejected and recorded for auditing.
The default weights prioritize machine-actionability and transfer potential:
- Actionability (0.4) – Can the agent execute this? Is the runtime handle valid and mapped to a real tool?
- Generality (0.3) – Will this transfer to new tasks? Are there hardcoded IDs, timestamps, or overfit patterns?
- Hookability (0.2) – Can we detect when this applies? Is the cue machine-readable (valid regex/keyword/predicate)?
- Concision (0.1) – Is the guidance succinct? Does it respect character limits without verbosity?
Tune these in learning.rubric_weights if your workload demands different priorities. For example, set generality: 0.5 if cross-task transfer is most critical, or increase concision: 0.2 if prompt token budgets are tight.
Atlas reads these rails from the existing learning block in your agent config (for example configs/<project>.yaml). If you omit the block, Atlas instantiates the default LearningConfig. To enable stricter constraints or adjust weights, add a section like:
learning:
enabled: true
update_enabled: true
schema:
allowed_runtime_handles:
- logs.search
- data.query*
cue_types: [regex, keyword]
default_scope_category: reinforcement
gates:
enforce_actionability: true
enforce_cue: true
enforce_generality: true
max_text_length: 420
allowed_proper_nouns: [SQL, HTTP, JSON, Atlas]
rubric_weights:
actionability: 0.4
generality: 0.3
hookability: 0.2
concision: 0.1
usage_tracking:
enabled: true
capture_examples: true
max_examples_per_entry: 3schemaconstrains what the LLM can emit (permitted runtime handles/prefixes, cue types, default scope category).gatestoggles the rubric guards and tunes generalisation heuristics (length budget, banned tokens, allowlists).rubric_weightsrebias the weighted playbook entry score if you want concision or hookability to matter more/less.usage_trackingenables cue/adoption logging and limits how many example snippets are stored per playbook entry.
All other learning options (llm, prompts, history_limit, session_note_enabled, apply_to_prompts) behave as before. Once configured, every synthesis run honours these settings automatically.
- Discovery loop –
atlas env initrecords discovery telemetry per task indiscovery_runs(Postgres) and.atlas/discover.json. Thepersist_discovery_runhelper mirrors the payload in Postgres for correlation. - Runtime sessions –
atlas run(oratlas.core.run) stores every session in Postgres with:sessions.metadata.learning_keyidentifying the learning thread.sessions.metadata.adaptive_summarydetailing execution mode decisions.sessions.reward_stats,sessions.reward_audit, andtrajectory_eventscapturing reward and behavioural traces.
- Learning registry –
learning_registrykeeps the latest playbook perlearning_keywhen the learning synthesizer is enabled.
Tip: Set
STORAGE__DATABASE_URLbefore running Atlas so the runtime connects to Postgres automatically.
When you need portable traces for offline analysis or external training pipelines:
python -m atlas.cli.export \
--database-url postgresql://atlas:atlas@localhost:5433/atlas \
--output results/learning/sessions.jsonl \
--limit 200 \
--include-status pending --include-status approved \
--trajectory-event-limit 400Each JSONL record surfaces:
execution_mode(top-level +session_metadata.adaptive_summary)learning_key,reward_stats,reward_audit, andsession_rewardtrajectory_eventswithevent_typeandactor
Runtime instrumentation now enriches every session with an impact snapshot so we can reason about how each
playbook entry contributes to adaptive efficiency (faster wins on known tasks) and cross-incident transfer (reusing
guidance when the incident changes). The tracker captures reward/token deltas, incident identifiers, retry counts,
and failure summaries per entry. The evaluation harness aggregates these into the playbook metadata under
playbook_entries[].impact and exposes a dedicated Playbook Entry Impact section in both JSON and Markdown.
- Adoption rate – successful adoptions ÷ cue hits. A hit without adoption indicates guidance being seen but not followed; sustained adoption >60 % is a good reinforcement signal.
- Reward delta – average reward for sessions where the entry fired minus the average reward when it did not. Positive deltas demonstrate adaptive efficiency (more wins when guidance triggers); negative deltas suggest the entry may be stale or misleading.
- Token delta – average tokens with the entry firing minus tokens without it. Negative numbers imply efficiency gains (doing the job in fewer tokens); positive spikes highlight regressions in runtime cost.
- Transfer success – marked true when the entry triggers across at least two distinct task identifiers, demonstrating cross-task reuse.
- Failure avoidance stats – rolling average retries and recorded failure events when the entry fires. Falling retry counts or zero failure events indicate the entry is preventing repeat mistakes.
- Impact score –
adoption_rate × reward_delta. This composite favors entries that are both frequently adopted and deliver positive reward deltas. Treat it as a prioritisation heuristic when curating the playbook: entries with negative scores should be audited first.
The same signals are stored session-by-session under metadata.learning_usage.session so you can audit individual
runs or recompute experiment-specific aggregates.
Use the new scripts/report_learning.py helper to assemble structured summaries per learning key. The script queries
Postgres directly—no JSONL export required—and produces JSON + Markdown reports under results/learning/.
python scripts/report_learning.py \
--database-url postgresql://atlas:atlas@localhost:5433/atlas \
--recent-window 10 \
--baseline-window 50 \
--limit 5 \
--prompt-variant schema_v2 \
--synthesis-model gpt-4o-mini --synthesis-model claude-3-sonnet \
--pamphlet-injection togglePlaybook verification: Leave
learning.apply_to_promptsenabled (default:true) when running the script so generated summaries reflect the same guidance injected into runtime prompts. The resultingresults/learning/*_summary.jsonentries include the currentteacher_playbookpayload for verification.
--summary-only– skips per-session trajectory fetches and relies on SQL event counts; use this for large sweeps or CI.--batch-size– max number of learning keys evaluated concurrently (default: 4).--filter-project,--filter-task,--filter-tag– narrow the run to a specific codebase, task name, or session tags.--learning-key– analyze explicit keys instead of querying Postgres for the top-N recent keys.--compare-to results/learning/index.json– diff the current run against a previous harness export; the manifest stores per-key deltas and Markdown files append a comparison section.--no-markdown– emit only machine-readable JSON for automation scenarios.--prompt-variant– label the prompt/meta-prompt variant under test.--synthesis-model– record the LLM(s) used for pamphlet generation (repeatable, feeds model benchmarking comparisons).--pamphlet-injection– annotate whether pamphlet injection was on/off/toggled for transfer tests.--playbook-entry-labels– reference a JSON file with manual playbook entry category overrides (stored in the manifest for downstream tooling).
Summary mode is ideal for nightly or CI jobs where you just need reward deltas and model trends. Run the full-detail mode (default) when you want trajectory event counts sampled per session and are comfortable with additional database reads.
Outputs:
results/learning/<slug>_summary.json– machine-readable payload (sessions, reward windows, discovery references).results/learning/<slug>_summary.md– human-friendly digest highlighting reward deltas, adaptive behaviour, and model breakdowns.results/learning/index.json– manifest listing every generated artifact, plus the comparison/aggregate tables when--compare-tois provided.playbook_impact(in both JSON + Markdown summaries) – per-entry adoption, reward/token delta, transfer, failure avoidance, and compositeimpact_scoremetrics for adaptive-efficiency tracking.run_metadata(inindex.json) – captures prompt variant, synthesis models, pamphlet toggle mode, and optional playbook entry label overrides supplied via CLI flags.
Pass --learning-key ... to target specific keys or --no-markdown when you only need JSON.
Each summary provides:
- Reward momentum – recent mean, baseline mean, and delta so you can spot positive/negative drift.
- Window context – recent/baseline window sizes so you can reason about sample counts.
- Execution modes – distribution of
execution_modevalues (auto, paired, coach, escalate) for the evaluated window. - Review state – counts per
review_statusto ensure you compare approved vs pending runs intentionally. - Model performance – per-role model breakdowns (session counts, reward averages, latest score) extracted from adapter telemetry so you can see which students/teachers are learning fastest.
- Discovery context – pointers to matching discovery/runtime telemetry (
discovery_runs) for the same task so you can replay the original traces. - Latest sessions – compact view of recent runs with reward/uncertainty snapshots and trajectory event counts.
- Playbook Entry Quality – aggregates the rubric outputs: candidate counts, gate failures, weighted score averages, and the weighting used.
- Playbook Entry Lifecycle – reinforcement vs differentiation counts split by
active/deprecated, plus rejected candidates from the latest run. - Playbook Entry Impact – adoption rate, reward/token deltas, transfer success, failure avoidance signals, and the composite
impact_scorefor each entry so you can prioritise curation according to adaptive-efficiency gains. - Runtime Usage – cue trigger totals, adoption counts, success rates, and trigger/adoption rates across sessions.
- Efficiency Snapshot – comparison of reward/tokens in sessions with cue hits versus those without, including deltas.
Because everything keys off learning_key, you can join the summary back to:
sessions(runtime telemetry + reward signals)discovery_runs(discovery/runtime captures recorded via the CLI)learning_registry(current pamphlet state, when enabled)
When you pass --compare-to, the harness looks up the previous index.json, loads each saved summary, and computes deltas for:
- Reward trends (recent mean, latest score)
- Session counts per learning key
- Model-level utilisation and reward mean changes
- Cue hit/adoption changes (derived from the usage metrics section)
The new manifest includes comparisons and aggregate leaderboards (best/worst deltas), while each Markdown report gains a “Comparison vs previous run” section.
To stress-test the learning synthesizer and meta-prompt variants, run targeted sweeps with the configs under
configs/eval/learning/:
baseline_openai.yaml— baseline Gemini 2.5 Flash synthesizer and reinforcement-focused prompt.scope_shift_openai.yaml— emphasises differentiation and transfer hypotheses; default scope category set todifferentiation.baseline_claude.yaml— Claude Haiku/Sonnet stack for student/teacher/synthesizer evaluation.
Generate fresh telemetry for each config (same dataset, different learning_keys), then compare playbook_impact
sections across runs. Prioritise variants that increase adoption rate without regressing token deltas, and flag any
entries with negative impact scores for remediation.
- The evaluation script ships with unit tests that stub database access and external LLM calls, so
pytestcovers the new entry points without touching live services. - To keep the workflow reproducible, commit the generated summaries or re-run the script as part of your evaluation pipeline once fresh telemetry lands.
- When experimenting with prompt variants or different synthesis models, record the configuration via the new CLI flags so the manifest preserves the experimental context.
This evaluation workflow lets you measure learning progress, identify high-impact playbook entries, and validate transfer learning across tasks.