feat(observability): add Grafana dashboards, Prometheus scrape config + AlertManager rules#12
Open
Kures wants to merge 2 commits into
Open
feat(observability): add Grafana dashboards, Prometheus scrape config + AlertManager rules#12Kures wants to merge 2 commits into
Kures wants to merge 2 commits into
Conversation
… + AlertManager rules
Partially addresses P1-03 from reference/todo.md ("structured
observability: request IDs, metrics endpoint, OTEL spans"). The
metrics endpoint already ships in src/metrics.zig and src/api.zig:70;
this contributes the operator side as a coherent stack.
Contents:
- dashboards/grafana/nullboiler-overview.json - high-level ops view (8 panels)
- dashboards/grafana/nullboiler-workers.json - per-fleet worker view (7 panels)
- dashboards/prometheus/prometheus.yml - minimal scrape config
- dashboards/alerts/nullboiler.rules.yml - 8 AlertManager rules
- dashboards/README.md - quick-start, panel index, alert table
Targets Grafana 10.x and 11.x (schemaVersion 39). PromQL is plain rate()
over the 11 counters exposed by /metrics, with clamp_min to avoid
divide-by-zero on idle clusters.
Both dashboards prompt for the Prometheus datasource via DS_PROMETHEUS
template variable so they import cleanly into existing setups.
Alert rules pair 1:1 with the dashboards: thresholds match the colour
bands on the panels so dashboard and pager tell the same story.
Validates clean with `promtool check rules`.
Side benefit: the Workers dashboard's "Failure ratios over time" panel
visualises Gap 3 from nullclaw/docs/integration-analysis.md
(HIGH PRIORITY) when a stock nullclaw worker is wired up via /webhook.
See "Diagnosing integration gaps" in dashboards/README.md.
Future work (not part of this PR):
- histogram metrics for HTTP and dispatch latency
- per-worker labels on dispatch counters
- recording rules + Grafana alerting integration
Companion to the upcoming feat/observability-gauges PR which adds the runs_in_flight, steps_in_flight, workers_healthy, and drain_mode gauges to /metrics. These panels light up once that PR lands; until then they show 'No data' against any image that does not yet expose the gauges. Layout: 4 stat panels in a new bottom row at gridPos y=20. Workers healthy uses a red-on-0 threshold so a dead worker pool is visible at a glance. Note: this commit also touches a few previously-inline arrays (e.g. "calcs": ["lastNotNull"]) that the JSON formatter expanded to multi-line format. No semantic change; reviewers can collapse those hunks visually.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR2 —
feat(observability): add Grafana dashboards, Prometheus scrape config + AlertManager rulesBranch:
feat/grafana-dashboardsPartially closes: P1-03 from
reference/todo.md(Structured observability: request IDs, metrics endpoint, OTEL spans)Side benefit: visually surfaces Gap 3 from
nullclaw/docs/integration-analysis.md(see "Diagnosing integration gaps" in
dashboards/README.md).Why
/metricsalready ships insrc/api.zig:70and exposes 11 Prometheuscounters from
src/metrics.zig. But there is currently no consumerfor those numbers — they are emitted into the void. This PR is the
operator side: drop-in dashboards and scrape config so that any
NullBoiler deployment is observable out of the box.
What this PR ships
Dashboard 1 — NullBoiler — Overview
Open this first when investigating "is something wrong?".
Dashboard 2 — NullBoiler — Workers
Open this when the Overview shows elevated dispatch failure ratio.
Alert rules
alerts/nullboiler.rules.ymlships 8 AlertManager rules undernullboiler.healthandnullboiler.flowgroups. Thresholds line up1:1 with the colour bands on the panels — tune one, mirror the other.
Validated with
promtool check rules(SUCCESS: 8 rules found).NullBoilerInstanceDown,NullBoilerDispatchFailureRatioCriticalNullBoilerDispatchFailureRatioHigh,NullBoilerWorkerHealthDegraded,NullBoilerCallbackDeliveryDegradedNullBoilerStepRetryRateElevated,NullBoilerNoTrafficForExtendedPeriod,NullBoilerIdempotentReplayRatioVeryHighCompatibility
schemaVersion: 39).${DS_PROMETHEUS}template variable so dashboardsimport cleanly into any existing Grafana with a Prometheus
datasource.
rate()over counters withclamp_minto avoiddivide-by-zero on idle clusters.
promtool check rules.How it was authored
src/metrics.zig(therenderPrometheusblock).yellow at first sign of degradation, red at clear-incident levels.
Operators should tune thresholds for their own fleet.
over line charts to make the ratio visible at a glance.
How to verify locally
Bonus: visualises a real ecosystem gap
The dashboards are not just throughput pretty-pictures. The Workers
view's Failure ratios over time panel cleanly separates two failure
modes that look identical in logs:
response contract
We brought the demo stack up end-to-end (NullBoiler + a stock
ghcr.io/nullclaw/nullclaw:latestworker pointed at a local Ollamarunning
llama3.2:1b) and the panel immediately diagnosedGap 3 from
nullclaw/docs/integration-analysis.md:the worker's
/webhookreturns the documented async ack{"status":"received"}instead of the synchronous{"status":"ok","response":"..."}shape NullBoiler expects (seedocs/single-nullclaw-integration.md§5 "Required NullClaw responsecontract"). On the dashboard this shows up as dispatch failure
ratio at 100% (red) while health-check failure ratio stays at 0%
(green) — a one-glance triage answer.
That gap is in NullClaw's roadmap, not this PR; the value here is that
the dashboards make the gap observable in real deployments instead
of buried in step error_text.
Why review attention is low-risk
dashboards/directory; no existing files modified.${DS_PROMETHEUS}variable means dashboards do not hardcode anyparticular datasource UID and import cleanly into any Grafana with a
Prometheus configured.
Follow-ups (not in this PR)
src/metrics.zig(would unlock percentile panels).breakdown panels — currently the workers dashboard shows fleet-wide
aggregates).
These are natural next steps for full P1-03 closure but are out of
scope for this PR.