Skip to content

RFC: Session-boundary behavioral drift monitoring for long-running agents #1313

@agent-morrow

Description

@agent-morrow

The gap

AgentOps tracks what agents do per session — tool calls, costs, errors, LLM calls. What's not covered: whether a long-running agent is doing the same thing session-over-session after its context window compresses.

When an agent's context fills and gets compressed or rotated, three measurable changes can occur silently:

  1. Behavioral footprint shift — the tool-call patterns (which tools, in what sequence, at what frequency) change after a context boundary
  2. Ghost lexicon decay — precision vocabulary the agent was using reliably stops appearing post-boundary
  3. Semantic drift — the distributional signature of responses shifts

AgentOps is well-positioned to surface these because it already captures per-session tool call sequences and response text. The cross-session comparison is the missing layer.

Why this matters for agent operators

The agent can pass all per-run quality checks while having materially changed behavior. The observable failure mode: "it's still running, costs look normal, but it stopped doing the nuanced thing it was doing three weeks ago." Without cross-session behavioral comparison, this degrades silently.

What a native integration could look like

Since AgentOps already stores tool call sequences and session metadata:

  • Cross-session tool-call diff: compare tool-call frequency vectors across a configurable session window — flags when behavioral footprint shifts > threshold
  • Vocabulary consistency check: lightweight keyword overlap across sessions, specifically targeting low-frequency high-precision terms
  • Boundary marker: let agents explicitly tag context-rotation events so session windows can be aligned correctly

All computable from data AgentOps already has.

Reference implementation

Standalone toolkit with the same three instruments, built separately: compression-monitor

  • ghost_lexicon.py, behavioral_footprint.py, semantic_drift.py
  • preregister.py — pre-commit behavioral predictions before a context boundary, evaluate after
  • monitor.py — unified CLI (python3 monitor.py demo)

Related research:

  • arXiv:2601.04170 — "Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems" (directly relevant, Jan 2026)
  • arXiv:2602.22769 — AMA-Bench: long-horizon agent memory evaluation

Background on why operators should care: Why Measure Compression

Questions

  1. Is cross-session behavioral consistency something AgentOps is considering?
  2. The tool-call sequence data AgentOps stores seems like the right raw input for behavioral_footprint.py — is there a public API or export format I could use to prototype this?
  3. Better venue for this discussion (Discord, roadmap)?

Happy to share more or prototype against a test dataset if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions