feat(observability): add live-state gauges to /metrics by Kures · Pull Request #13 · nullclaw/nullboiler

Kures · 2026-05-12T07:36:11Z

PR4 — `feat(observability): add live-state gauges to /metrics`

Branch: feat/observability-gauges
Closes: the live-state half of P1-03 from reference/todo.md (alongside PR2's dashboards)

Why

The existing /metrics endpoint exposes 11 cumulative counters — useful for "how many things happened total" but blind to what's happening right now. Operators looking at the dashboard during an incident can't answer:

"How many runs are in flight?" — counter runs_created_total only tells you "since process start"
"Are any workers healthy at this moment?" — there's a worker_health_failures_total counter, but no current state
"Did anyone forget the orchestrator is in drain mode?" — no signal exposed

This PR fills that gap with 4 gauge metrics, derived from the source-of-truth state at scrape time — no wiring across hot-paths.

What this PR ships

`src/metrics.zig`

New Metrics.Sample struct (4 fields, i64 each)
New renderPrometheusWithSample(allocator, sample) function that emits the existing 11 counter blocks followed by 4 # TYPE ... gauge blocks
Existing renderPrometheus(allocator) preserved as a thin wrapper that calls the new variant with a zero Sample. All 340 existing tests still pass.

# TYPE nullboiler_runs_in_flight gauge
nullboiler_runs_in_flight 7
# TYPE nullboiler_steps_in_flight gauge
nullboiler_steps_in_flight 12
# TYPE nullboiler_workers_healthy gauge
nullboiler_workers_healthy 3
# TYPE nullboiler_drain_mode gauge
nullboiler_drain_mode 0

`src/store.zig`

3 new helper methods that match the existing countStepsByStatus / countRunningStepsByWorker idiom (prepared statement, bind, single-row fetch):

countRunsInFlight() — SELECT COUNT(*) FROM runs WHERE status IN ('running', 'pending')
countStepsRunning() — SELECT COUNT(*) FROM steps WHERE status = 'running'
countWorkersByStatus(status) — generic, used here with 'active'

`src/api.zig`

handleMetrics(ctx) now calls a new computeGaugeSample(ctx) helper and passes the result into renderPrometheusWithSample
computeGaugeSample falls back to 0 on any DB error, so a flaky query never poisons the entire /metrics response — the counter half stays valid

Why "sample at scrape time" instead of inc/dec wiring

Run terminal transitions touch engine.zig in 16+ places (lines 419, 535, 563, 622, 643, 652, 698, 713, 721, 858, 953, 969, 979, 993, 1002, 1021, 1031); step transitions in store.zig:769. Maintaining inc/dec calls at every site would be:

A maintenance treadmill — every new terminal transition needs new wiring
Easy to drift — miss a site and the gauge silently leaks
Invasive on hot-paths — any inc/dec runs on every state transition

Sampling at exposition time avoids all three. The cost is 4 indexed SELECT COUNT(*) queries per scrape (~once every 15 s in typical Prometheus configurations) — negligible against a SQLite DB.

How to validate

zig build test --summary all
# Build Summary: 9/9 steps succeeded; 340/340 tests passed

Direct curl against a running binary:

zig build
./zig-out/bin/nullboiler --port 8080 --db /tmp/test.db &
curl -s http://localhost:8080/metrics | grep -E "(in_flight|workers_healthy|drain_mode)"

Expected: 4 new metric lines + 4 # TYPE ... gauge headers, in addition to the 11 existing counter lines.

To exercise drain_mode:

curl -X POST http://localhost:8080/admin/drain
curl -s http://localhost:8080/metrics | grep drain_mode  # → 1

Why review attention is low-risk

No business-logic touched. Engine, scheduler, dispatch, store CRUD — all unchanged. The patch is wholly additive in the observability layer.
No HTTP contract change. /metrics still responds with text/plain Prometheus exposition; the new lines are additive.
Existing API preserved. renderPrometheus(allocator) keeps its old signature for any caller that didn't migrate; incr() is unchanged.
Tests cover backward compatibility. All 340 pre-existing tests pass without modification.
Cheap to revert. Three files, ~80 lines added, zero deletions in business code.

Companion changes (separate PR)

A companion commit on feat/grafana-dashboards adds 4 stat panels to nullboiler-overview.json for these gauges. Reviewing this PR alone is fine — the dashboards just show "No data" until this PR merges and a fresh image is built.

Note on end-to-end smoke

I was unable to perform live smoke against a running binary built from this branch because of an unrelated regression on main HEAD (PR #5 feat/zig-0.16 merge): every HTTP response succeeds at the application layer but no bytes ever reach the client. Filed separately as a HIGH-priority bug report. The unit-test suite passes 340/340 against this branch.

Follow-ups (not in this PR)

Histogram metrics for request latency (p50/p95/p99) — separate PR; would require either inline observe() instrumentation or a multi-bucket counter pattern.
A small integration test that spawns the binary and asserts /metrics output, which would have caught the unrelated HTTP regression mentioned above.

Adds four gauge metrics derived from the source-of-truth state at Prometheus scrape time, complementing the existing 11 cumulative counters: nullboiler_runs_in_flight runs in 'running' or 'pending' state nullboiler_steps_in_flight steps currently 'running' nullboiler_workers_healthy workers in 'active' state nullboiler_drain_mode 1 when drain is active, 0 otherwise Implementation samples the SoT — DB COUNT(*) queries and the existing drain atomic — at exposition time rather than maintaining inc/dec wiring across the engine and store hot-paths. Gauge values are accurate by construction and the patch makes zero changes to business-logic call sites. Files touched: src/metrics.zig - new Metrics.Sample struct with the four gauge values - new renderPrometheusWithSample(allocator, sample) that emits the 11 counters followed by 4 # TYPE ... gauge lines - existing renderPrometheus(allocator) preserved as a thin wrapper so all 340 existing tests continue to pass unchanged src/store.zig - countRunsInFlight(): 'SELECT COUNT(*) FROM runs WHERE status IN (running, pending)' - countStepsRunning(): 'SELECT COUNT(*) FROM steps WHERE status = running' - countWorkersByStatus(status): generic, used here with 'active' matches the existing prepare/finalize idiom used by countStepsByStatus and countRunningStepsByWorker in the same file src/api.zig - handleMetrics now calls computeGaugeSample(ctx) and passes the result into renderPrometheusWithSample - computeGaugeSample falls back to 0 on any DB error so a flaky query never poisons the entire /metrics response — the counter half stays valid Validation: zig build test --summary all Build Summary: 9/9 steps succeeded; 340/340 tests passed Live smoke test against a running binary built from this branch is blocked by an unrelated regression on main (filed separately): every HTTP response is dropped before reaching the client because writer flush succeeds but bytes never leave userspace. The published ghcr.io/nullclaw/nullboiler:2026.3.2 image (built before that regression) does not yet expose these gauges, so end-to-end smoke will be possible after the upstream HTTP fix lands. Closes the live-state half of P1-03 alongside the dashboard updates on feat/grafana-dashboards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): add live-state gauges to /metrics#13

feat(observability): add live-state gauges to /metrics#13
Kures wants to merge 1 commit into
nullclaw:mainfrom
Kures:feat/observability-gauges

Kures commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kures commented May 12, 2026

PR4 — feat(observability): add live-state gauges to /metrics

Why

What this PR ships

src/metrics.zig

src/store.zig

src/api.zig

Why "sample at scrape time" instead of inc/dec wiring

How to validate

Why review attention is low-risk

Companion changes (separate PR)

Note on end-to-end smoke

Follow-ups (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PR4 — `feat(observability): add live-state gauges to /metrics`

`src/metrics.zig`

`src/store.zig`

`src/api.zig`