feat(observability): add Grafana dashboards, Prometheus scrape config + AlertManager rules by Kures · Pull Request #12 · nullclaw/nullboiler

Kures · 2026-05-12T07:36:09Z

PR2 — `feat(observability): add Grafana dashboards, Prometheus scrape config + AlertManager rules`

Branch: feat/grafana-dashboards
Partially closes: P1-03 from reference/todo.md (Structured observability: request IDs, metrics endpoint, OTEL spans)
Side benefit: visually surfaces Gap 3 from
nullclaw/docs/integration-analysis.md
(see "Diagnosing integration gaps" in dashboards/README.md).

Why

/metrics already ships in src/api.zig:70 and exposes 11 Prometheus
counters from src/metrics.zig. But there is currently no consumer
for those numbers — they are emitted into the void. This PR is the
operator side: drop-in dashboards and scrape config so that any
NullBoiler deployment is observable out of the box.

What this PR ships

dashboards/
├── README.md                            quick-start + panel index
├── grafana/
│   ├── nullboiler-overview.json         high-level operations view
│   └── nullboiler-workers.json          per-fleet worker view
└── prometheus/
    └── prometheus.yml                   minimal scrape config

Dashboard 1 — NullBoiler — Overview

Open this first when investigating "is something wrong?".

Panel	Question it answers
HTTP requests/sec	Is anyone talking to us?
Runs created/sec	Is work flowing in?
Worker dispatch failure ratio (5m)	Are dispatches failing?
Callback failures/sec	Are run-lifecycle webhooks reaching consumers?
Run & step throughput	Created vs replayed vs claimed vs retried
Worker dispatch (success vs failure)	Stacked-area dispatch outcomes
Callbacks (sent vs failed)	Webhook delivery reliability
Reliability ratios	Idempotent replay ratio + step retry ratio with thresholds

Dashboard 2 — NullBoiler — Workers

Open this when the Overview shows elevated dispatch failure ratio.

Panel	Question it answers
Health checks/sec	Are health probes running?
Health-check failure ratio (5m)	Are workers responding to probes?
Dispatch success/failure /sec	Per-second outcomes
Health-check rate (probe vs failure)	Probe timeline
Dispatch outcomes (stacked bars)	Discrete dispatch outcomes
Failure ratios over time	The signal the circuit breaker reacts to

Alert rules

alerts/nullboiler.rules.yml ships 8 AlertManager rules under
nullboiler.health and nullboiler.flow groups. Thresholds line up
1:1 with the colour bands on the panels — tune one, mirror the other.
Validated with promtool check rules (SUCCESS: 8 rules found).

Severity	Rules
critical (page)	`NullBoilerInstanceDown`, `NullBoilerDispatchFailureRatioCritical`
warning (ticket)	`NullBoilerDispatchFailureRatioHigh`, `NullBoilerWorkerHealthDegraded`, `NullBoilerCallbackDeliveryDegraded`
info	`NullBoilerStepRetryRateElevated`, `NullBoilerNoTrafficForExtendedPeriod`, `NullBoilerIdempotentReplayRatioVeryHigh`

Compatibility

Targets Grafana 10.x and 11.x (schemaVersion: 39).
Uses the standard ${DS_PROMETHEUS} template variable so dashboards
import cleanly into any existing Grafana with a Prometheus
datasource.
PromQL is plain rate() over counters with clamp_min to avoid
divide-by-zero on idle clusters.
Alert rules tested with promtool check rules.

How it was authored

Metric names taken from src/metrics.zig (the
renderPrometheus block).
Threshold colors chosen to be conservative — green by default,
yellow at first sign of degradation, red at clear-incident levels.
Operators should tune thresholds for their own fleet.
Stacked-bar panels (success+failure on the workers dashboard) chosen
over line charts to make the ratio visible at a glance.

How to verify locally

# 1. Bring up nullboiler (docker-compose or zig build run).
# 2. Bring up Prometheus + Grafana side cars.
docker run -d --name prom -p 9090:9090 \
  -v "$(pwd)/dashboards/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro" \
  prom/prometheus

docker run -d --name grafana -p 3030:3000 \
  -e GF_AUTH_ANONYMOUS_ENABLED=true \
  -e GF_AUTH_ANONYMOUS_ORG_ROLE=Admin \
  grafana/grafana

# 3. Add a Prometheus datasource pointing at http://host.docker.internal:9090
# 4. Import each JSON via Dashboards -> Import -> Upload JSON

Bonus: visualises a real ecosystem gap

The dashboards are not just throughput pretty-pictures. The Workers
view's Failure ratios over time panel cleanly separates two failure
modes that look identical in logs:

Health-check failure ratio — worker unreachable
Dispatch failure ratio — worker reachable but breaks the
response contract

We brought the demo stack up end-to-end (NullBoiler + a stock
ghcr.io/nullclaw/nullclaw:latest worker pointed at a local Ollama
running llama3.2:1b) and the panel immediately diagnosed
Gap 3 from
nullclaw/docs/integration-analysis.md:
the worker's /webhook returns the documented async ack
{"status":"received"} instead of the synchronous
{"status":"ok","response":"..."} shape NullBoiler expects (see
docs/single-nullclaw-integration.md §5 "Required NullClaw response
contract"). On the dashboard this shows up as dispatch failure
ratio at 100% (red) while health-check failure ratio stays at 0%
(green) — a one-glance triage answer.

That gap is in NullClaw's roadmap, not this PR; the value here is that
the dashboards make the gap observable in real deployments instead
of buried in step error_text.

Why review attention is low-risk

Pure JSON/YAML/Markdown. No Zig touched.
New top-level dashboards/ directory; no existing files modified.
The ${DS_PROMETHEUS} variable means dashboards do not hardcode any
particular datasource UID and import cleanly into any Grafana with a
Prometheus configured.

Follow-ups (not in this PR)

Histogram metrics for HTTP latency and worker dispatch duration in
src/metrics.zig (would unlock percentile panels).
Per-worker labels on the dispatch counters (would unlock per-worker
breakdown panels — currently the workers dashboard shows fleet-wide
aggregates).
AlertManager rule files for common SLO breaches.

These are natural next steps for full P1-03 closure but are out of
scope for this PR.

… + AlertManager rules Partially addresses P1-03 from reference/todo.md ("structured observability: request IDs, metrics endpoint, OTEL spans"). The metrics endpoint already ships in src/metrics.zig and src/api.zig:70; this contributes the operator side as a coherent stack. Contents: - dashboards/grafana/nullboiler-overview.json - high-level ops view (8 panels) - dashboards/grafana/nullboiler-workers.json - per-fleet worker view (7 panels) - dashboards/prometheus/prometheus.yml - minimal scrape config - dashboards/alerts/nullboiler.rules.yml - 8 AlertManager rules - dashboards/README.md - quick-start, panel index, alert table Targets Grafana 10.x and 11.x (schemaVersion 39). PromQL is plain rate() over the 11 counters exposed by /metrics, with clamp_min to avoid divide-by-zero on idle clusters. Both dashboards prompt for the Prometheus datasource via DS_PROMETHEUS template variable so they import cleanly into existing setups. Alert rules pair 1:1 with the dashboards: thresholds match the colour bands on the panels so dashboard and pager tell the same story. Validates clean with `promtool check rules`. Side benefit: the Workers dashboard's "Failure ratios over time" panel visualises Gap 3 from nullclaw/docs/integration-analysis.md (HIGH PRIORITY) when a stock nullclaw worker is wired up via /webhook. See "Diagnosing integration gaps" in dashboards/README.md. Future work (not part of this PR): - histogram metrics for HTTP and dispatch latency - per-worker labels on dispatch counters - recording rules + Grafana alerting integration

Companion to the upcoming feat/observability-gauges PR which adds the runs_in_flight, steps_in_flight, workers_healthy, and drain_mode gauges to /metrics. These panels light up once that PR lands; until then they show 'No data' against any image that does not yet expose the gauges. Layout: 4 stat panels in a new bottom row at gridPos y=20. Workers healthy uses a red-on-0 threshold so a dead worker pool is visible at a glance. Note: this commit also touches a few previously-inline arrays (e.g. "calcs": ["lastNotNull"]) that the JSON formatter expanded to multi-line format. No semantic change; reviewers can collapse those hunks visually.

Kures added 2 commits May 8, 2026 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): add Grafana dashboards, Prometheus scrape config + AlertManager rules#12

feat(observability): add Grafana dashboards, Prometheus scrape config + AlertManager rules#12
Kures wants to merge 2 commits into
nullclaw:mainfrom
Kures:feat/grafana-dashboards

Kures commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kures commented May 12, 2026

PR2 — feat(observability): add Grafana dashboards, Prometheus scrape config + AlertManager rules

Why

What this PR ships

Dashboard 1 — NullBoiler — Overview

Dashboard 2 — NullBoiler — Workers

Alert rules

Compatibility

How it was authored

How to verify locally

Bonus: visualises a real ecosystem gap

Why review attention is low-risk

Follow-ups (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PR2 — `feat(observability): add Grafana dashboards, Prometheus scrape config + AlertManager rules`