Skip to content

feat(observability): add live-state gauges to /metrics#13

Open
Kures wants to merge 1 commit into
nullclaw:mainfrom
Kures:feat/observability-gauges
Open

feat(observability): add live-state gauges to /metrics#13
Kures wants to merge 1 commit into
nullclaw:mainfrom
Kures:feat/observability-gauges

Conversation

@Kures
Copy link
Copy Markdown

@Kures Kures commented May 12, 2026

PR4 — feat(observability): add live-state gauges to /metrics

Branch: feat/observability-gauges
Closes: the live-state half of P1-03 from reference/todo.md (alongside PR2's dashboards)

Why

The existing /metrics endpoint exposes 11 cumulative counters — useful for "how many things happened total" but blind to what's happening right now. Operators looking at the dashboard during an incident can't answer:

  • "How many runs are in flight?" — counter runs_created_total only tells you "since process start"
  • "Are any workers healthy at this moment?" — there's a worker_health_failures_total counter, but no current state
  • "Did anyone forget the orchestrator is in drain mode?" — no signal exposed

This PR fills that gap with 4 gauge metrics, derived from the source-of-truth state at scrape time — no wiring across hot-paths.

What this PR ships

src/metrics.zig

  • New Metrics.Sample struct (4 fields, i64 each)
  • New renderPrometheusWithSample(allocator, sample) function that emits the existing 11 counter blocks followed by 4 # TYPE ... gauge blocks
  • Existing renderPrometheus(allocator) preserved as a thin wrapper that calls the new variant with a zero Sample. All 340 existing tests still pass.
# TYPE nullboiler_runs_in_flight gauge
nullboiler_runs_in_flight 7
# TYPE nullboiler_steps_in_flight gauge
nullboiler_steps_in_flight 12
# TYPE nullboiler_workers_healthy gauge
nullboiler_workers_healthy 3
# TYPE nullboiler_drain_mode gauge
nullboiler_drain_mode 0

src/store.zig

3 new helper methods that match the existing countStepsByStatus / countRunningStepsByWorker idiom (prepared statement, bind, single-row fetch):

  • countRunsInFlight()SELECT COUNT(*) FROM runs WHERE status IN ('running', 'pending')
  • countStepsRunning()SELECT COUNT(*) FROM steps WHERE status = 'running'
  • countWorkersByStatus(status) — generic, used here with 'active'

src/api.zig

  • handleMetrics(ctx) now calls a new computeGaugeSample(ctx) helper and passes the result into renderPrometheusWithSample
  • computeGaugeSample falls back to 0 on any DB error, so a flaky query never poisons the entire /metrics response — the counter half stays valid

Why "sample at scrape time" instead of inc/dec wiring

Run terminal transitions touch engine.zig in 16+ places (lines 419, 535, 563, 622, 643, 652, 698, 713, 721, 858, 953, 969, 979, 993, 1002, 1021, 1031); step transitions in store.zig:769. Maintaining inc/dec calls at every site would be:

  • A maintenance treadmill — every new terminal transition needs new wiring
  • Easy to drift — miss a site and the gauge silently leaks
  • Invasive on hot-paths — any inc/dec runs on every state transition

Sampling at exposition time avoids all three. The cost is 4 indexed SELECT COUNT(*) queries per scrape (~once every 15 s in typical Prometheus configurations) — negligible against a SQLite DB.

How to validate

zig build test --summary all
# Build Summary: 9/9 steps succeeded; 340/340 tests passed

Direct curl against a running binary:

zig build
./zig-out/bin/nullboiler --port 8080 --db /tmp/test.db &
curl -s http://localhost:8080/metrics | grep -E "(in_flight|workers_healthy|drain_mode)"

Expected: 4 new metric lines + 4 # TYPE ... gauge headers, in addition to the 11 existing counter lines.

To exercise drain_mode:

curl -X POST http://localhost:8080/admin/drain
curl -s http://localhost:8080/metrics | grep drain_mode  # → 1

Why review attention is low-risk

  • No business-logic touched. Engine, scheduler, dispatch, store CRUD — all unchanged. The patch is wholly additive in the observability layer.
  • No HTTP contract change. /metrics still responds with text/plain Prometheus exposition; the new lines are additive.
  • Existing API preserved. renderPrometheus(allocator) keeps its old signature for any caller that didn't migrate; incr() is unchanged.
  • Tests cover backward compatibility. All 340 pre-existing tests pass without modification.
  • Cheap to revert. Three files, ~80 lines added, zero deletions in business code.

Companion changes (separate PR)

A companion commit on feat/grafana-dashboards adds 4 stat panels to nullboiler-overview.json for these gauges. Reviewing this PR alone is fine — the dashboards just show "No data" until this PR merges and a fresh image is built.

Note on end-to-end smoke

I was unable to perform live smoke against a running binary built from this branch because of an unrelated regression on main HEAD (PR #5 feat/zig-0.16 merge): every HTTP response succeeds at the application layer but no bytes ever reach the client. Filed separately as a HIGH-priority bug report. The unit-test suite passes 340/340 against this branch.

Follow-ups (not in this PR)

  • Histogram metrics for request latency (p50/p95/p99) — separate PR; would require either inline observe() instrumentation or a multi-bucket counter pattern.
  • A small integration test that spawns the binary and asserts /metrics output, which would have caught the unrelated HTTP regression mentioned above.

Adds four gauge metrics derived from the source-of-truth state at
Prometheus scrape time, complementing the existing 11 cumulative
counters:

  nullboiler_runs_in_flight    runs in 'running' or 'pending' state
  nullboiler_steps_in_flight   steps currently 'running'
  nullboiler_workers_healthy   workers in 'active' state
  nullboiler_drain_mode        1 when drain is active, 0 otherwise

Implementation samples the SoT — DB COUNT(*) queries and the existing
drain atomic — at exposition time rather than maintaining inc/dec
wiring across the engine and store hot-paths. Gauge values are
accurate by construction and the patch makes zero changes to
business-logic call sites.

Files touched:

  src/metrics.zig
    - new Metrics.Sample struct with the four gauge values
    - new renderPrometheusWithSample(allocator, sample) that emits the
      11 counters followed by 4 # TYPE ... gauge lines
    - existing renderPrometheus(allocator) preserved as a thin wrapper
      so all 340 existing tests continue to pass unchanged

  src/store.zig
    - countRunsInFlight(): 'SELECT COUNT(*) FROM runs WHERE status IN
      (running, pending)'
    - countStepsRunning(): 'SELECT COUNT(*) FROM steps WHERE status =
      running'
    - countWorkersByStatus(status): generic, used here with 'active'
    matches the existing prepare/finalize idiom used by
    countStepsByStatus and countRunningStepsByWorker in the same file

  src/api.zig
    - handleMetrics now calls computeGaugeSample(ctx) and passes the
      result into renderPrometheusWithSample
    - computeGaugeSample falls back to 0 on any DB error so a flaky
      query never poisons the entire /metrics response — the counter
      half stays valid

Validation:

  zig build test --summary all
  Build Summary: 9/9 steps succeeded; 340/340 tests passed

Live smoke test against a running binary built from this branch is
blocked by an unrelated regression on main (filed separately): every
HTTP response is dropped before reaching the client because writer
flush succeeds but bytes never leave userspace. The published
ghcr.io/nullclaw/nullboiler:2026.3.2 image (built before that
regression) does not yet expose these gauges, so end-to-end smoke
will be possible after the upstream HTTP fix lands.

Closes the live-state half of P1-03 alongside the dashboard updates
on feat/grafana-dashboards.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant