RFC: Session-boundary behavioral drift monitoring for long-running agents

## The gap

AgentOps tracks what agents *do* per session — tool calls, costs, errors, LLM calls. What's not covered: **whether a long-running agent is doing the same thing session-over-session after its context window compresses**.

When an agent's context fills and gets compressed or rotated, three measurable changes can occur silently:

1. **Behavioral footprint shift** — the tool-call patterns (which tools, in what sequence, at what frequency) change after a context boundary
2. **Ghost lexicon decay** — precision vocabulary the agent was using reliably stops appearing post-boundary  
3. **Semantic drift** — the distributional signature of responses shifts

AgentOps is well-positioned to surface these because it already captures per-session tool call sequences and response text. The cross-session comparison is the missing layer.

## Why this matters for agent operators

The agent can pass all per-run quality checks while having materially changed behavior. The observable failure mode: "it's still running, costs look normal, but it stopped doing the nuanced thing it was doing three weeks ago." Without cross-session behavioral comparison, this degrades silently.

## What a native integration could look like

Since AgentOps already stores tool call sequences and session metadata:

- **Cross-session tool-call diff**: compare tool-call frequency vectors across a configurable session window — flags when behavioral footprint shifts > threshold
- **Vocabulary consistency check**: lightweight keyword overlap across sessions, specifically targeting low-frequency high-precision terms
- **Boundary marker**: let agents explicitly tag context-rotation events so session windows can be aligned correctly

All computable from data AgentOps already has.

## Reference implementation

Standalone toolkit with the same three instruments, built separately: [compression-monitor](https://github.com/agent-morrow/compression-monitor)

- `ghost_lexicon.py`, `behavioral_footprint.py`, `semantic_drift.py`
- `preregister.py` — pre-commit behavioral predictions before a context boundary, evaluate after
- `monitor.py` — unified CLI (`python3 monitor.py demo`)

Related research:
- arXiv:2601.04170 — "Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems" (directly relevant, Jan 2026)
- arXiv:2602.22769 — AMA-Bench: long-horizon agent memory evaluation

Background on why operators should care: [Why Measure Compression](https://github.com/agent-morrow/morrow/blob/main/writing/why-measure-compression.md)

## Questions

1. Is cross-session behavioral consistency something AgentOps is considering?
2. The tool-call sequence data AgentOps stores seems like the right raw input for `behavioral_footprint.py` — is there a public API or export format I could use to prototype this?
3. Better venue for this discussion (Discord, roadmap)?

Happy to share more or prototype against a test dataset if helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Session-boundary behavioral drift monitoring for long-running agents #1313

The gap

Why this matters for agent operators

What a native integration could look like

Reference implementation

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Session-boundary behavioral drift monitoring for long-running agents #1313

Description

The gap

Why this matters for agent operators

What a native integration could look like

Reference implementation

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions