Skip to content

Prathuvj/TokenOptEnv

Repository files navigation

title TokenOptEnv Environment Server
emoji 😼
colorFrom blue
colorTo indigo
sdk docker
pinned false
app_port 8000
base_path /
tags
openenv
reinforcement-learning
benchmark
grpo

TokenOptEnv

TokenOptEnv is an OpenEnv environment for CAMRE: a cost-aware meta-reasoning benchmark for code and log tasks. Agents do not just solve tasks. They also learn how to manage context, retrieval, memory, checkpoints, compression, and model-routing strategy under explicit token and cost budgets.

This repository captures the full hackathon story across both V1 and V2 of CAMRE.

  • V1 establishes the core benchmark idea: explicit cache, compression, routing, and budget-aware decision-making instead of hidden middleware.
  • V2 extends that foundation with bounded working memory, checkpoints, selective hydration, milestone shaping, and escalation-aware routing for longer-horizon tasks.

The current deployed runtime is the V2 environment, but the repo intentionally preserves the V1 notebook, earlier benchmark framing, and the evolution path between the two versions.

Version Overview

V1 Core Contributions

  • explicit action space for reading artifacts, querying cache, compressing context, routing subtasks, and submitting answers
  • deterministic graders and seeded task scenarios
  • a real-world model catalog with simulated cost/quality behavior
  • composite reward shaping around correctness, budget use, and routing decisions

V2 Extensions

  • Bounded working memory so the agent cannot keep everything loaded forever.
  • Explicit checkpoints so the agent must decide what facts to preserve before purging memory.
  • Artifact handles and selective hydration so large/noisy artifacts are often previewed first and only the relevant segments are loaded into working memory.
  • Cascade routing with confidence-based escalation so a weak route can be escalated to a stronger model only when it is justified.
  • Milestone-based shaping so long-horizon tasks emit dense process rewards rather than only terminal outcomes.

Together, V1 and V2 make CAMRE much closer to the real systems problem the benchmark is trying to measure: not just solving a task, but solving it efficiently, strategically, and durably over long trajectories.

Benchmark Goals

CAMRE is designed to answer questions like:

  • When should an agent read directly versus hydrate targeted evidence?
  • When should it checkpoint knowledge before purging working memory?
  • When should it query cache instead of recomputing?
  • When should it compress context?
  • When should it escalate from a cheap model to a stronger one?

The benchmark is deterministic enough for repeatable RL training and structured enough for decomposed reward analysis.

Project Links

Result Snapshots

The current public training artifact is the step-100 export from the first stable CAMRE V2 GRPO run. That is the public checkpoint currently linked in the Spaces and model repo, but the broader project and benchmark narrative spans both the original V1 environment design and the later V2 long-horizon extension.

Reward Progress

CAMRE V2 reward progress

Training Diagnostics

CAMRE V2 training diagnostics

Task Families

CAMRE currently ships three code/log-focused task families:

  • incident_cache_lookup
  • noisy_log_triage_summary
  • bug_triage_and_repair_decision

Each scenario includes the V1 benchmark core, with V2 runtime extensions layered on top where appropriate:

  • structured artifacts with segment-level salience
  • seeded cache opportunities
  • compression targets
  • routing hints
  • deterministic structured-answer ground truth
  • V2 working-memory constraints
  • V2 checkpoint policy
  • V2 milestone graph
  • V2 cascade-routing policy

Action Space

V1 Core Actions

  • read_artifact
  • query_cache
  • compress_context
  • route_subtask
  • submit_answer
  • terminate_episode

V2 Extended Actions

The current deployed environment adds these long-horizon control actions on top of the V1 core:

  • hydrate_segments
  • create_checkpoint
  • hydrate_checkpoint
  • purge_working_memory

Across both versions, the design goal is the same: make the control problem explicit instead of hiding retrieval, memory, and routing choices inside middleware.

Observation Surface

V1 exposes the core task, budget, cache, compression, and routing telemetry. The current V2 deployment extends that surface so the agent can additionally observe:

  • task family, scenario title, and answer schema
  • artifact manifest with access mode and preview handles
  • latest tool result
  • token and cost budgets
  • cache statistics
  • compression statistics
  • routing history and confidence signals
  • working-memory usage and loaded entries
  • checkpoint manifest
  • milestone progress counters
  • reward components and guardrail events

Hidden state remains private inside the environment so the agent cannot directly inspect oracle truth, milestone conditions, or routing internals.

Reward Design

CAMRE uses a composite reward design across both versions.

V1 Reward Foundations

  • terminal correctness
  • partial progress
  • action validity
  • cache decision quality
  • compression decision quality
  • routing decision quality
  • budget efficiency

V2 Additions

V2 keeps the V1 components and adds longer-horizon process supervision:

  • milestone progress
  • memory-management quality
  • hydration efficiency
  • routing-escalation quality

Guardrails penalize failure modes such as:

  • invalid actions
  • repeated loops
  • unsafe purge behavior
  • working-memory overflow
  • checkpoint spam
  • repeated routing without need
  • critical-context loss during compression

Model Catalog

The frozen catalog lives in catalog.py and includes:

  • open-source SLMs
  • open-source LLMs
  • closed-source SLMs
  • closed-source LLMs

The catalog is based on real-world model identities and metadata, but runtime behavior is simulated so training and evaluation remain reproducible.

Runtime Architecture

The current deployed runtime is the V2 architecture, split into focused modules:

  • server/TokenOptEnv_environment.py - OpenEnv adapter and episode lifecycle
  • episode_runtime.py - internal per-episode mutable runtime state
  • memory_manager.py - working memory, hydration, purge, and checkpoints
  • milestone_engine.py - hidden milestone tracking and milestone rewards
  • observation_builder.py - public observation/state construction
  • simulators.py - cache, compression, and routing simulation
  • rewards.py - composite reward calculation and anti-cheat guardrails
  • scenario_store.py - seeded benchmark scenarios and graders
  • task_defs.py - scenario schemas and V2 config objects
  • models.py - OpenEnv-facing action, observation, and state models

Quick Start

from TokenOptEnv import ActionType, TokenOptEnvAction
from TokenOptEnv.server.TokenOptEnv_environment import TokenOptEnvEnvironment

env = TokenOptEnvEnvironment()
obs = env.reset(scenario_id="incident-medium-kafka-backpressure", seed=7)

obs = env.step(
    TokenOptEnvAction(
        action_type=ActionType.READ_ARTIFACT,
        artifact_id="log-reconciler-5521",
    )
)

print(obs.last_tool_result.payload["access_mode"])
print(obs.last_tool_result.payload["preview_segment_ids"])

obs = env.step(
    TokenOptEnvAction(
        action_type=ActionType.HYDRATE_SEGMENTS,
        context_segment_ids=["log-reconciler-5521-s1"],
    )
)

print(obs.working_memory.token_budget_used)
print(obs.milestones.achieved_ids)

Training Notebooks

The repo includes both versions of the training workflow:

  • notebooks/CAMRE_GRPO_Training.ipynb - original V1 GRPO notebook
  • notebooks/CAMRE_GRPO_Training_V2.ipynb - operational V2 notebook aligned to the public step-100 export flow
  • notebooks/CAMRE_GRPO_Training_V2_Clean.ipynb - polished V2 notebook with explanatory markdown for reproducible reruns

In other words: V1 documents the initial benchmark/training path, and V2 documents the longer-horizon upgrade path.

Running the Server Locally

uvicorn server.app:app --reload --host 0.0.0.0 --port 8000

Repo Structure

TokenOptEnv/
|-- assets/
|   `-- plots/
|       |-- camre_v2_reward_progress.png
|       `-- camre_v2_training_diagnostics.png
|-- Dockerfile
|-- __init__.py
|-- catalog.py
|-- episode_runtime.py
|-- memory_manager.py
|-- milestone_engine.py
|-- models.py
|-- notebooks/
|   |-- CAMRE_GRPO_Training.ipynb
|   |-- CAMRE_GRPO_Training_V2.ipynb
|   `-- CAMRE_GRPO_Training_V2_Clean.ipynb
|-- observation_builder.py
|-- openenv.yaml
|-- pyproject.toml
|-- README.md
|-- rewards.py
|-- scenario_store.py
|-- simulators.py
|-- task_defs.py
|-- uv.lock
`-- server/
    |-- __init__.py
    |-- TokenOptEnv_environment.py
    |-- app.py
    `-- Dockerfile

Current Positioning

CAMRE as a project is positioned as a research benchmark first:

  • deterministic enough for repeatable RL training
  • realistic enough to study cost-aware inference control from V1 through V2
  • explicit enough to expose routing, memory, checkpoint, and escalation decisions as learnable actions
  • structured enough to support detailed reward decomposition and debugging

Public Artifact Policy

  • Environment runtime: public and actively deployed in the Environment Space (currently the V2 runtime)
  • Training workspace: hosted separately in the Training Space
  • Model artifact: public adapter repo built around the exported step-100 CAMRE V2 checkpoint

This keeps the environment, training workflow, and model artifact separately reusable while still linking them together as one benchmark story that includes both the original V1 benchmark and the extended V2 runtime.

Notes

  • The public runtime repo is intentionally lean. Some local-only utilities used during development may not be tracked here.
  • The benchmark is designed to evolve while preserving comparability across scenario IDs and task families.

About

TokenOptEnv is an OpenEnv environment for CAMRE: a cost-aware meta-reasoning benchmark for code and log tasks. Agents do not just solve tasks. They also learn how to manage context, retrieval, memory, checkpoints, compression, and model-routing strategy under explicit token and cost budgets.

Topics

Resources

Stars

Watchers

Forks

Contributors