Event based observability

<html><head></head><body><h1>PulseBot Agent Observability: Detailed Design</h1>
<h2>Comprehensive Event Streaming via <code>pulsebot.events</code></h2>
<p><strong>Version:</strong> 1.0
<strong>Author:</strong> Timeplus Engineering
<strong>Date:</strong> 2026-03-20</p>
<hr>
<h2>1. Motivation</h2>
<p>PulseBot already logs LLM calls to <code>pulsebot.llm_logs</code> and tool executions to <code>pulsebot.tool_logs</code>. However, the agent's full behavior — lifecycle transitions, state changes, session management, multi-agent coordination, memory operations, hook verdicts, skill hot-reloading, channel connectivity, and scheduled task execution — remains invisible.</p>
<p>The <code>pulsebot.events</code> stream exists today but is barely used. This design proposes a unified, structured event taxonomy that turns <code>pulsebot.events</code> into the <strong>single source of truth</strong> for everything the agent does, enabling real-time dashboards, anomaly detection via streaming SQL, compliance audits, and post-mortem debugging.</p>
<h3>Design Principles</h3>
<ol>
<li><strong>Every state transition emits an event.</strong> If something changes, we know about it.</li>
<li><strong>Events are cheap.</strong> The <code>payload</code> field carries JSON; producers just call <code>events_writer.write()</code>.</li>
<li><strong>Streaming SQL is the query layer.</strong> All dashboards and alerts are streaming SQL over <code>pulsebot.events</code>.</li>
<li><strong>Correlation via <code>session_id</code> + <code>agent_id</code>.</strong> Every event can be traced back to a user session and a specific agent instance.</li>
<li><strong>Severity is meaningful.</strong> <code>debug</code> for verbose tracing, <code>info</code> for normal operations, <code>warning</code> for degraded states, <code>error</code> for failures, <code>critical</code> for unrecoverable situations.</li>
</ol>
<hr>
<h2>2. Stream Schema</h2>
<p>The existing <code>pulsebot.events</code> stream schema is sufficient and requires no DDL changes:</p>
<pre><code class="language-sql">CREATE STREAM IF NOT EXISTS pulsebot.events (
    id          string DEFAULT uuid(),
    timestamp   datetime64(3) DEFAULT now64(3),
    event_type  string,         -- Hierarchical: 'agent.started', 'session.opened', etc.
    source      string,         -- Who emitted: 'agent:main', 'channel:telegram', 'skill:shell'
    severity    string,         -- 'debug', 'info', 'warning', 'error', 'critical'
    payload     string,         -- JSON: event-specific data
    tags        array(string)   -- Filterable labels: ['lifecycle', 'agent:main', 'session:abc123']
)
SETTINGS event_time_column='timestamp';
</code></pre>
<h3>Conventions</h3>

Field | Convention | Example
-- | -- | --
event_type | Dot-separated hierarchy: {category}.{action} | agent.started, tool.hook_denied
source | {component_type}:{instance_id} | agent:main, channel:telegram, skill:shell
severity | syslog-style 5-level | info
payload | Flat or shallow JSON, ≤ 4KB recommended | {"agent_id":"main","model":"claude-sonnet-4-20250514"}
tags | Include category + identifiers for fast filtering | ["lifecycle","agent:main"]


<p><strong>Total: 118 event types across 13 categories.</strong></p>
<hr>
<h2>11. Implementation Roadmap</h2>
<h3>Phase 1: Foundation (Week 1)</h3>
<ul>
<li>[ ] Implement <code>EventWriter</code> utility class with severity filtering</li>
<li>[ ] Add <code>observability.events</code> config section to <code>config.yaml</code></li>
<li>[ ] Integrate <code>EventWriter</code> into <code>Agent.__init__</code></li>
<li>[ ] Emit agent lifecycle events: <code>agent.starting</code>, <code>agent.ready</code>, <code>agent.stopped</code>, <code>agent.crash</code></li>
<li>[ ] Emit agent state events: <code>agent.state.*</code></li>
<li>[ ] Emit session events: <code>session.opened</code>, <code>session.response_sent</code>, <code>session.error</code></li>
</ul>
<h3>Phase 2: Tool &amp; Hook Observability (Week 2)</h3>
<ul>
<li>[ ] Pass <code>EventWriter</code> into <code>ToolExecutor</code></li>
<li>[ ] Emit tool events: <code>tool.call_started</code>, <code>tool.call_completed</code>, <code>tool.call_failed</code>, <code>tool.not_found</code></li>
<li>[ ] Emit hook events: <code>tool.hook_denied</code>, <code>tool.hook_modified</code>, all <code>hook.*</code> events</li>
<li>[ ] Emit LLM high-level events: <code>llm.call_started</code>, <code>llm.call_completed</code>, <code>llm.call_failed</code></li>
</ul>
<h3>Phase 3: Memory, Skills &amp; Tasks (Week 3)</h3>
<ul>
<li>[ ] Add <code>EventWriter</code> to <code>MemoryManager</code></li>
<li>[ ] Emit memory events: <code>memory.search_*</code>, <code>memory.extraction_*</code>, <code>memory.stored</code></li>
<li>[ ] Emit skill events from <code>SkillLoader</code>: <code>skill.loaded</code>, <code>skill.hot_reloaded</code></li>
<li>[ ] Emit skill manager events: <code>skill.installed</code>, <code>skill.removed</code></li>
<li>[ ] Emit task events from scheduler skill and <code>TaskScheduler</code></li>
<li>[ ] Emit context building events from <code>ContextBuilder</code></li>
</ul>
<h3>Phase 4: Multi-Agent &amp; Channels (Week 4)</h3>
<ul>
<li>[ ] Add <code>EventWriter</code> to <code>SubAgent</code>, <code>ManagerAgent</code>, <code>ProjectManager</code></li>
<li>[ ] Emit project events: <code>project.created</code>, <code>project.completed</code>, <code>project.failed</code></li>
<li>[ ] Emit channel events from <code>TelegramChannel</code> (and future channels)</li>
<li>[ ] Emit system events: <code>system.startup</code>, <code>system.shutdown</code>, <code>system.heartbeat</code></li>
</ul>
<h3>Phase 5: Dashboards &amp; Alerts (Week 5)</h3>
<ul>
<li>[ ] Create streaming SQL dashboard views for Timeplus UI / Grafana</li>
<li>[ ] Implement stuck-agent detection alert as a materialized view</li>
<li>[ ] Implement error-rate spike alert</li>
<li>[ ] Document all streaming SQL recipes</li>
<li>[ ] End-to-end integration test: emit events → query via SQL → verify</li>
</ul></body></html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event based observability #86

PulseBot Agent Observability: Detailed Design

Comprehensive Event Streaming via `pulsebot.events`

1. Motivation

Design Principles

2. Stream Schema

Conventions

11. Implementation Roadmap

Phase 1: Foundation (Week 1)

Phase 2: Tool & Hook Observability (Week 2)

Phase 3: Memory, Skills & Tasks (Week 3)

Phase 4: Multi-Agent & Channels (Week 4)

Phase 5: Dashboards & Alerts (Week 5)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	Convention	Example
event_type	Dot-separated hierarchy: {category}.{action}	agent.started, tool.hook_denied
source	{component_type}:{instance_id}	agent:main, channel:telegram, skill:shell
severity	syslog-style 5-level	info
payload	Flat or shallow JSON, ≤ 4KB recommended	{"agent_id":"main","model":"claude-sonnet-4-20250514"}
tags	Include category + identifiers for fast filtering	["lifecycle","agent:main"]

Event based observability #86

Description

PulseBot Agent Observability: Detailed Design

Comprehensive Event Streaming via pulsebot.events

1. Motivation

Design Principles

2. Stream Schema

Conventions

11. Implementation Roadmap

Phase 1: Foundation (Week 1)

Phase 2: Tool & Hook Observability (Week 2)

Phase 3: Memory, Skills & Tasks (Week 3)

Phase 4: Multi-Agent & Channels (Week 4)

Phase 5: Dashboards & Alerts (Week 5)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Comprehensive Event Streaming via `pulsebot.events`