Skip to content

feat(observability): add Grafana dashboards, Prometheus scrape config + AlertManager rules#12

Open
Kures wants to merge 2 commits into
nullclaw:mainfrom
Kures:feat/grafana-dashboards
Open

feat(observability): add Grafana dashboards, Prometheus scrape config + AlertManager rules#12
Kures wants to merge 2 commits into
nullclaw:mainfrom
Kures:feat/grafana-dashboards

Conversation

@Kures
Copy link
Copy Markdown

@Kures Kures commented May 12, 2026

PR2 — feat(observability): add Grafana dashboards, Prometheus scrape config + AlertManager rules

Branch: feat/grafana-dashboards
Partially closes: P1-03 from reference/todo.md (Structured observability: request IDs, metrics endpoint, OTEL spans)
Side benefit: visually surfaces Gap 3 from
nullclaw/docs/integration-analysis.md
(see "Diagnosing integration gaps" in dashboards/README.md).

Why

/metrics already ships in src/api.zig:70 and exposes 11 Prometheus
counters from src/metrics.zig. But there is currently no consumer
for those numbers — they are emitted into the void. This PR is the
operator side: drop-in dashboards and scrape config so that any
NullBoiler deployment is observable out of the box.

What this PR ships

dashboards/
├── README.md                            quick-start + panel index
├── grafana/
│   ├── nullboiler-overview.json         high-level operations view
│   └── nullboiler-workers.json          per-fleet worker view
└── prometheus/
    └── prometheus.yml                   minimal scrape config

Dashboard 1 — NullBoiler — Overview

Open this first when investigating "is something wrong?".

Panel Question it answers
HTTP requests/sec Is anyone talking to us?
Runs created/sec Is work flowing in?
Worker dispatch failure ratio (5m) Are dispatches failing?
Callback failures/sec Are run-lifecycle webhooks reaching consumers?
Run & step throughput Created vs replayed vs claimed vs retried
Worker dispatch (success vs failure) Stacked-area dispatch outcomes
Callbacks (sent vs failed) Webhook delivery reliability
Reliability ratios Idempotent replay ratio + step retry ratio with thresholds

Dashboard 2 — NullBoiler — Workers

Open this when the Overview shows elevated dispatch failure ratio.

Panel Question it answers
Health checks/sec Are health probes running?
Health-check failure ratio (5m) Are workers responding to probes?
Dispatch success/failure /sec Per-second outcomes
Health-check rate (probe vs failure) Probe timeline
Dispatch outcomes (stacked bars) Discrete dispatch outcomes
Failure ratios over time The signal the circuit breaker reacts to

Alert rules

alerts/nullboiler.rules.yml ships 8 AlertManager rules under
nullboiler.health and nullboiler.flow groups. Thresholds line up
1:1 with the colour bands on the panels — tune one, mirror the other.
Validated with promtool check rules (SUCCESS: 8 rules found).

Severity Rules
critical (page) NullBoilerInstanceDown, NullBoilerDispatchFailureRatioCritical
warning (ticket) NullBoilerDispatchFailureRatioHigh, NullBoilerWorkerHealthDegraded, NullBoilerCallbackDeliveryDegraded
info NullBoilerStepRetryRateElevated, NullBoilerNoTrafficForExtendedPeriod, NullBoilerIdempotentReplayRatioVeryHigh

Compatibility

  • Targets Grafana 10.x and 11.x (schemaVersion: 39).
  • Uses the standard ${DS_PROMETHEUS} template variable so dashboards
    import cleanly into any existing Grafana with a Prometheus
    datasource.
  • PromQL is plain rate() over counters with clamp_min to avoid
    divide-by-zero on idle clusters.
  • Alert rules tested with promtool check rules.

How it was authored

  • Metric names taken from src/metrics.zig (the
    renderPrometheus block).
  • Threshold colors chosen to be conservative — green by default,
    yellow at first sign of degradation, red at clear-incident levels.
    Operators should tune thresholds for their own fleet.
  • Stacked-bar panels (success+failure on the workers dashboard) chosen
    over line charts to make the ratio visible at a glance.

How to verify locally

# 1. Bring up nullboiler (docker-compose or zig build run).
# 2. Bring up Prometheus + Grafana side cars.
docker run -d --name prom -p 9090:9090 \
  -v "$(pwd)/dashboards/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro" \
  prom/prometheus

docker run -d --name grafana -p 3030:3000 \
  -e GF_AUTH_ANONYMOUS_ENABLED=true \
  -e GF_AUTH_ANONYMOUS_ORG_ROLE=Admin \
  grafana/grafana

# 3. Add a Prometheus datasource pointing at http://host.docker.internal:9090
# 4. Import each JSON via Dashboards -> Import -> Upload JSON

Bonus: visualises a real ecosystem gap

The dashboards are not just throughput pretty-pictures. The Workers
view's Failure ratios over time panel cleanly separates two failure
modes that look identical in logs:

  • Health-check failure ratio — worker unreachable
  • Dispatch failure ratio — worker reachable but breaks the
    response contract

We brought the demo stack up end-to-end (NullBoiler + a stock
ghcr.io/nullclaw/nullclaw:latest worker pointed at a local Ollama
running llama3.2:1b) and the panel immediately diagnosed
Gap 3 from
nullclaw/docs/integration-analysis.md:
the worker's /webhook returns the documented async ack
{"status":"received"} instead of the synchronous
{"status":"ok","response":"..."} shape NullBoiler expects (see
docs/single-nullclaw-integration.md §5 "Required NullClaw response
contract"). On the dashboard this shows up as dispatch failure
ratio at 100% (red) while health-check failure ratio stays at 0%
(green)
— a one-glance triage answer.

That gap is in NullClaw's roadmap, not this PR; the value here is that
the dashboards make the gap observable in real deployments instead
of buried in step error_text.

Why review attention is low-risk

  • Pure JSON/YAML/Markdown. No Zig touched.
  • New top-level dashboards/ directory; no existing files modified.
  • The ${DS_PROMETHEUS} variable means dashboards do not hardcode any
    particular datasource UID and import cleanly into any Grafana with a
    Prometheus configured.

Follow-ups (not in this PR)

  • Histogram metrics for HTTP latency and worker dispatch duration in
    src/metrics.zig (would unlock percentile panels).
  • Per-worker labels on the dispatch counters (would unlock per-worker
    breakdown panels — currently the workers dashboard shows fleet-wide
    aggregates).
  • AlertManager rule files for common SLO breaches.

These are natural next steps for full P1-03 closure but are out of
scope for this PR.

Kures added 2 commits May 8, 2026 13:22
… + AlertManager rules

Partially addresses P1-03 from reference/todo.md ("structured
observability: request IDs, metrics endpoint, OTEL spans"). The
metrics endpoint already ships in src/metrics.zig and src/api.zig:70;
this contributes the operator side as a coherent stack.

Contents:
- dashboards/grafana/nullboiler-overview.json   - high-level ops view (8 panels)
- dashboards/grafana/nullboiler-workers.json    - per-fleet worker view (7 panels)
- dashboards/prometheus/prometheus.yml          - minimal scrape config
- dashboards/alerts/nullboiler.rules.yml        - 8 AlertManager rules
- dashboards/README.md                          - quick-start, panel index, alert table

Targets Grafana 10.x and 11.x (schemaVersion 39). PromQL is plain rate()
over the 11 counters exposed by /metrics, with clamp_min to avoid
divide-by-zero on idle clusters.

Both dashboards prompt for the Prometheus datasource via DS_PROMETHEUS
template variable so they import cleanly into existing setups.

Alert rules pair 1:1 with the dashboards: thresholds match the colour
bands on the panels so dashboard and pager tell the same story.
Validates clean with `promtool check rules`.

Side benefit: the Workers dashboard's "Failure ratios over time" panel
visualises Gap 3 from nullclaw/docs/integration-analysis.md
(HIGH PRIORITY) when a stock nullclaw worker is wired up via /webhook.
See "Diagnosing integration gaps" in dashboards/README.md.

Future work (not part of this PR):
- histogram metrics for HTTP and dispatch latency
- per-worker labels on dispatch counters
- recording rules + Grafana alerting integration
Companion to the upcoming feat/observability-gauges PR which adds the
runs_in_flight, steps_in_flight, workers_healthy, and drain_mode
gauges to /metrics. These panels light up once that PR lands; until
then they show 'No data' against any image that does not yet expose
the gauges.

Layout: 4 stat panels in a new bottom row at gridPos y=20. Workers
healthy uses a red-on-0 threshold so a dead worker pool is visible at
a glance.

Note: this commit also touches a few previously-inline arrays
(e.g. "calcs": ["lastNotNull"]) that the JSON formatter expanded to
multi-line format. No semantic change; reviewers can collapse those
hunks visually.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant