Skip to content

Event based observability #86

@gangtao

Description

@gangtao

PulseBot Agent Observability: Detailed Design

Comprehensive Event Streaming via pulsebot.events

Version: 1.0 Author: Timeplus Engineering Date: 2026-03-20


1. Motivation

PulseBot already logs LLM calls to pulsebot.llm_logs and tool executions to pulsebot.tool_logs. However, the agent's full behavior — lifecycle transitions, state changes, session management, multi-agent coordination, memory operations, hook verdicts, skill hot-reloading, channel connectivity, and scheduled task execution — remains invisible.

The pulsebot.events stream exists today but is barely used. This design proposes a unified, structured event taxonomy that turns pulsebot.events into the single source of truth for everything the agent does, enabling real-time dashboards, anomaly detection via streaming SQL, compliance audits, and post-mortem debugging.

Design Principles

  1. Every state transition emits an event. If something changes, we know about it.
  2. Events are cheap. The payload field carries JSON; producers just call events_writer.write().
  3. Streaming SQL is the query layer. All dashboards and alerts are streaming SQL over pulsebot.events.
  4. Correlation via session_id + agent_id. Every event can be traced back to a user session and a specific agent instance.
  5. Severity is meaningful. debug for verbose tracing, info for normal operations, warning for degraded states, error for failures, critical for unrecoverable situations.

2. Stream Schema

The existing pulsebot.events stream schema is sufficient and requires no DDL changes:

CREATE STREAM IF NOT EXISTS pulsebot.events (
    id          string DEFAULT uuid(),
    timestamp   datetime64(3) DEFAULT now64(3),
    event_type  string,         -- Hierarchical: 'agent.started', 'session.opened', etc.
    source      string,         -- Who emitted: 'agent:main', 'channel:telegram', 'skill:shell'
    severity    string,         -- 'debug', 'info', 'warning', 'error', 'critical'
    payload     string,         -- JSON: event-specific data
    tags        array(string)   -- Filterable labels: ['lifecycle', 'agent:main', 'session:abc123']
)
SETTINGS event_time_column='timestamp';

Conventions

Field Convention Example
event_type Dot-separated hierarchy: {category}.{action} agent.started, tool.hook_denied
source {component_type}:{instance_id} agent:main, channel:telegram, skill:shell
severity syslog-style 5-level info
payload Flat or shallow JSON, ≤ 4KB recommended {"agent_id":"main","model":"claude-sonnet-4-20250514"}
tags Include category + identifiers for fast filtering ["lifecycle","agent:main"]

Total: 118 event types across 13 categories.


11. Implementation Roadmap

Phase 1: Foundation (Week 1)

  • Implement EventWriter utility class with severity filtering
  • Add observability.events config section to config.yaml
  • Integrate EventWriter into Agent.__init__
  • Emit agent lifecycle events: agent.starting, agent.ready, agent.stopped, agent.crash
  • Emit agent state events: agent.state.*
  • Emit session events: session.opened, session.response_sent, session.error

Phase 2: Tool & Hook Observability (Week 2)

  • Pass EventWriter into ToolExecutor
  • Emit tool events: tool.call_started, tool.call_completed, tool.call_failed, tool.not_found
  • Emit hook events: tool.hook_denied, tool.hook_modified, all hook.* events
  • Emit LLM high-level events: llm.call_started, llm.call_completed, llm.call_failed

Phase 3: Memory, Skills & Tasks (Week 3)

  • Add EventWriter to MemoryManager
  • Emit memory events: memory.search_*, memory.extraction_*, memory.stored
  • Emit skill events from SkillLoader: skill.loaded, skill.hot_reloaded
  • Emit skill manager events: skill.installed, skill.removed
  • Emit task events from scheduler skill and TaskScheduler
  • Emit context building events from ContextBuilder

Phase 4: Multi-Agent & Channels (Week 4)

  • Add EventWriter to SubAgent, ManagerAgent, ProjectManager
  • Emit project events: project.created, project.completed, project.failed
  • Emit channel events from TelegramChannel (and future channels)
  • Emit system events: system.startup, system.shutdown, system.heartbeat

Phase 5: Dashboards & Alerts (Week 5)

  • Create streaming SQL dashboard views for Timeplus UI / Grafana
  • Implement stuck-agent detection alert as a materialized view
  • Implement error-rate spike alert
  • Document all streaming SQL recipes
  • End-to-end integration test: emit events → query via SQL → verify

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions