The gap
AgentOps tracks what agents do per session — tool calls, costs, errors, LLM calls. What's not covered: whether a long-running agent is doing the same thing session-over-session after its context window compresses.
When an agent's context fills and gets compressed or rotated, three measurable changes can occur silently:
- Behavioral footprint shift — the tool-call patterns (which tools, in what sequence, at what frequency) change after a context boundary
- Ghost lexicon decay — precision vocabulary the agent was using reliably stops appearing post-boundary
- Semantic drift — the distributional signature of responses shifts
AgentOps is well-positioned to surface these because it already captures per-session tool call sequences and response text. The cross-session comparison is the missing layer.
Why this matters for agent operators
The agent can pass all per-run quality checks while having materially changed behavior. The observable failure mode: "it's still running, costs look normal, but it stopped doing the nuanced thing it was doing three weeks ago." Without cross-session behavioral comparison, this degrades silently.
What a native integration could look like
Since AgentOps already stores tool call sequences and session metadata:
- Cross-session tool-call diff: compare tool-call frequency vectors across a configurable session window — flags when behavioral footprint shifts > threshold
- Vocabulary consistency check: lightweight keyword overlap across sessions, specifically targeting low-frequency high-precision terms
- Boundary marker: let agents explicitly tag context-rotation events so session windows can be aligned correctly
All computable from data AgentOps already has.
Reference implementation
Standalone toolkit with the same three instruments, built separately: compression-monitor
ghost_lexicon.py, behavioral_footprint.py, semantic_drift.py
preregister.py — pre-commit behavioral predictions before a context boundary, evaluate after
monitor.py — unified CLI (python3 monitor.py demo)
Related research:
- arXiv:2601.04170 — "Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems" (directly relevant, Jan 2026)
- arXiv:2602.22769 — AMA-Bench: long-horizon agent memory evaluation
Background on why operators should care: Why Measure Compression
Questions
- Is cross-session behavioral consistency something AgentOps is considering?
- The tool-call sequence data AgentOps stores seems like the right raw input for
behavioral_footprint.py — is there a public API or export format I could use to prototype this?
- Better venue for this discussion (Discord, roadmap)?
Happy to share more or prototype against a test dataset if helpful.
The gap
AgentOps tracks what agents do per session — tool calls, costs, errors, LLM calls. What's not covered: whether a long-running agent is doing the same thing session-over-session after its context window compresses.
When an agent's context fills and gets compressed or rotated, three measurable changes can occur silently:
AgentOps is well-positioned to surface these because it already captures per-session tool call sequences and response text. The cross-session comparison is the missing layer.
Why this matters for agent operators
The agent can pass all per-run quality checks while having materially changed behavior. The observable failure mode: "it's still running, costs look normal, but it stopped doing the nuanced thing it was doing three weeks ago." Without cross-session behavioral comparison, this degrades silently.
What a native integration could look like
Since AgentOps already stores tool call sequences and session metadata:
All computable from data AgentOps already has.
Reference implementation
Standalone toolkit with the same three instruments, built separately: compression-monitor
ghost_lexicon.py,behavioral_footprint.py,semantic_drift.pypreregister.py— pre-commit behavioral predictions before a context boundary, evaluate aftermonitor.py— unified CLI (python3 monitor.py demo)Related research:
Background on why operators should care: Why Measure Compression
Questions
behavioral_footprint.py— is there a public API or export format I could use to prototype this?Happy to share more or prototype against a test dataset if helpful.