-
Notifications
You must be signed in to change notification settings - Fork 0
Description
PulseBot Agent Observability: Detailed Design
Comprehensive Event Streaming via pulsebot.events
Version: 1.0 Author: Timeplus Engineering Date: 2026-03-20
1. Motivation
PulseBot already logs LLM calls to pulsebot.llm_logs and tool executions to pulsebot.tool_logs. However, the agent's full behavior — lifecycle transitions, state changes, session management, multi-agent coordination, memory operations, hook verdicts, skill hot-reloading, channel connectivity, and scheduled task execution — remains invisible.
The pulsebot.events stream exists today but is barely used. This design proposes a unified, structured event taxonomy that turns pulsebot.events into the single source of truth for everything the agent does, enabling real-time dashboards, anomaly detection via streaming SQL, compliance audits, and post-mortem debugging.
Design Principles
- Every state transition emits an event. If something changes, we know about it.
- Events are cheap. The
payloadfield carries JSON; producers just callevents_writer.write(). - Streaming SQL is the query layer. All dashboards and alerts are streaming SQL over
pulsebot.events. - Correlation via
session_id+agent_id. Every event can be traced back to a user session and a specific agent instance. - Severity is meaningful.
debugfor verbose tracing,infofor normal operations,warningfor degraded states,errorfor failures,criticalfor unrecoverable situations.
2. Stream Schema
The existing pulsebot.events stream schema is sufficient and requires no DDL changes:
CREATE STREAM IF NOT EXISTS pulsebot.events (
id string DEFAULT uuid(),
timestamp datetime64(3) DEFAULT now64(3),
event_type string, -- Hierarchical: 'agent.started', 'session.opened', etc.
source string, -- Who emitted: 'agent:main', 'channel:telegram', 'skill:shell'
severity string, -- 'debug', 'info', 'warning', 'error', 'critical'
payload string, -- JSON: event-specific data
tags array(string) -- Filterable labels: ['lifecycle', 'agent:main', 'session:abc123']
)
SETTINGS event_time_column='timestamp';
Conventions
| Field | Convention | Example |
|---|---|---|
| event_type | Dot-separated hierarchy: {category}.{action} | agent.started, tool.hook_denied |
| source | {component_type}:{instance_id} | agent:main, channel:telegram, skill:shell |
| severity | syslog-style 5-level | info |
| payload | Flat or shallow JSON, ≤ 4KB recommended | {"agent_id":"main","model":"claude-sonnet-4-20250514"} |
| tags | Include category + identifiers for fast filtering | ["lifecycle","agent:main"] |
Total: 118 event types across 13 categories.
11. Implementation Roadmap
Phase 1: Foundation (Week 1)
- Implement
EventWriterutility class with severity filtering - Add
observability.eventsconfig section toconfig.yaml - Integrate
EventWriterintoAgent.__init__ - Emit agent lifecycle events:
agent.starting,agent.ready,agent.stopped,agent.crash - Emit agent state events:
agent.state.* - Emit session events:
session.opened,session.response_sent,session.error
Phase 2: Tool & Hook Observability (Week 2)
- Pass
EventWriterintoToolExecutor - Emit tool events:
tool.call_started,tool.call_completed,tool.call_failed,tool.not_found - Emit hook events:
tool.hook_denied,tool.hook_modified, allhook.*events - Emit LLM high-level events:
llm.call_started,llm.call_completed,llm.call_failed
Phase 3: Memory, Skills & Tasks (Week 3)
- Add
EventWritertoMemoryManager - Emit memory events:
memory.search_*,memory.extraction_*,memory.stored - Emit skill events from
SkillLoader:skill.loaded,skill.hot_reloaded - Emit skill manager events:
skill.installed,skill.removed - Emit task events from scheduler skill and
TaskScheduler - Emit context building events from
ContextBuilder
Phase 4: Multi-Agent & Channels (Week 4)
- Add
EventWritertoSubAgent,ManagerAgent,ProjectManager - Emit project events:
project.created,project.completed,project.failed - Emit channel events from
TelegramChannel(and future channels) - Emit system events:
system.startup,system.shutdown,system.heartbeat
Phase 5: Dashboards & Alerts (Week 5)
- Create streaming SQL dashboard views for Timeplus UI / Grafana
- Implement stuck-agent detection alert as a materialized view
- Implement error-rate spike alert
- Document all streaming SQL recipes
- End-to-end integration test: emit events → query via SQL → verify