feat(observability): add live-state gauges to /metrics#13
Open
Kures wants to merge 1 commit into
Open
Conversation
Adds four gauge metrics derived from the source-of-truth state at
Prometheus scrape time, complementing the existing 11 cumulative
counters:
nullboiler_runs_in_flight runs in 'running' or 'pending' state
nullboiler_steps_in_flight steps currently 'running'
nullboiler_workers_healthy workers in 'active' state
nullboiler_drain_mode 1 when drain is active, 0 otherwise
Implementation samples the SoT — DB COUNT(*) queries and the existing
drain atomic — at exposition time rather than maintaining inc/dec
wiring across the engine and store hot-paths. Gauge values are
accurate by construction and the patch makes zero changes to
business-logic call sites.
Files touched:
src/metrics.zig
- new Metrics.Sample struct with the four gauge values
- new renderPrometheusWithSample(allocator, sample) that emits the
11 counters followed by 4 # TYPE ... gauge lines
- existing renderPrometheus(allocator) preserved as a thin wrapper
so all 340 existing tests continue to pass unchanged
src/store.zig
- countRunsInFlight(): 'SELECT COUNT(*) FROM runs WHERE status IN
(running, pending)'
- countStepsRunning(): 'SELECT COUNT(*) FROM steps WHERE status =
running'
- countWorkersByStatus(status): generic, used here with 'active'
matches the existing prepare/finalize idiom used by
countStepsByStatus and countRunningStepsByWorker in the same file
src/api.zig
- handleMetrics now calls computeGaugeSample(ctx) and passes the
result into renderPrometheusWithSample
- computeGaugeSample falls back to 0 on any DB error so a flaky
query never poisons the entire /metrics response — the counter
half stays valid
Validation:
zig build test --summary all
Build Summary: 9/9 steps succeeded; 340/340 tests passed
Live smoke test against a running binary built from this branch is
blocked by an unrelated regression on main (filed separately): every
HTTP response is dropped before reaching the client because writer
flush succeeds but bytes never leave userspace. The published
ghcr.io/nullclaw/nullboiler:2026.3.2 image (built before that
regression) does not yet expose these gauges, so end-to-end smoke
will be possible after the upstream HTTP fix lands.
Closes the live-state half of P1-03 alongside the dashboard updates
on feat/grafana-dashboards.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR4 —
feat(observability): add live-state gauges to /metricsBranch:
feat/observability-gaugesCloses: the live-state half of P1-03 from
reference/todo.md(alongside PR2's dashboards)Why
The existing
/metricsendpoint exposes 11 cumulative counters — useful for "how many things happened total" but blind to what's happening right now. Operators looking at the dashboard during an incident can't answer:runs_created_totalonly tells you "since process start"worker_health_failures_totalcounter, but no current stateThis PR fills that gap with 4 gauge metrics, derived from the source-of-truth state at scrape time — no wiring across hot-paths.
What this PR ships
src/metrics.zigMetrics.Samplestruct (4 fields,i64each)renderPrometheusWithSample(allocator, sample)function that emits the existing 11 counter blocks followed by 4# TYPE ... gaugeblocksrenderPrometheus(allocator)preserved as a thin wrapper that calls the new variant with a zeroSample. All 340 existing tests still pass.src/store.zig3 new helper methods that match the existing
countStepsByStatus/countRunningStepsByWorkeridiom (prepared statement, bind, single-row fetch):countRunsInFlight()—SELECT COUNT(*) FROM runs WHERE status IN ('running', 'pending')countStepsRunning()—SELECT COUNT(*) FROM steps WHERE status = 'running'countWorkersByStatus(status)— generic, used here with'active'src/api.zighandleMetrics(ctx)now calls a newcomputeGaugeSample(ctx)helper and passes the result intorenderPrometheusWithSamplecomputeGaugeSamplefalls back to0on any DB error, so a flaky query never poisons the entire/metricsresponse — the counter half stays validWhy "sample at scrape time" instead of inc/dec wiring
Run terminal transitions touch
engine.zigin 16+ places (lines 419, 535, 563, 622, 643, 652, 698, 713, 721, 858, 953, 969, 979, 993, 1002, 1021, 1031); step transitions instore.zig:769. Maintaining inc/dec calls at every site would be:Sampling at exposition time avoids all three. The cost is 4 indexed
SELECT COUNT(*)queries per scrape (~once every 15 s in typical Prometheus configurations) — negligible against a SQLite DB.How to validate
Direct curl against a running binary:
Expected: 4 new metric lines + 4
# TYPE ... gaugeheaders, in addition to the 11 existing counter lines.To exercise
drain_mode:Why review attention is low-risk
/metricsstill responds withtext/plainPrometheus exposition; the new lines are additive.renderPrometheus(allocator)keeps its old signature for any caller that didn't migrate;incr()is unchanged.Companion changes (separate PR)
A companion commit on
feat/grafana-dashboardsadds 4 stat panels tonullboiler-overview.jsonfor these gauges. Reviewing this PR alone is fine — the dashboards just show "No data" until this PR merges and a fresh image is built.Note on end-to-end smoke
I was unable to perform live smoke against a running binary built from this branch because of an unrelated regression on
mainHEAD (PR #5feat/zig-0.16merge): every HTTP response succeeds at the application layer but no bytes ever reach the client. Filed separately as a HIGH-priority bug report. The unit-test suite passes 340/340 against this branch.Follow-ups (not in this PR)
observe()instrumentation or a multi-bucket counter pattern./metricsoutput, which would have caught the unrelated HTTP regression mentioned above.