Skip to content

fix: pre-register auto top-up Prometheus counters#812

Closed
pjb157 wants to merge 1 commit intomainfrom
fix/pre-register-auto-topup-metrics
Closed

fix: pre-register auto top-up Prometheus counters#812
pjb157 wants to merge 1 commit intomainfrom
fix/pre-register-auto-topup-metrics

Conversation

@pjb157
Copy link
Copy Markdown
Contributor

@pjb157 pjb157 commented Mar 6, 2026

Summary

  • Pre-registers all four dwctl_auto_topup_* counters at the top of process_auto_topups() so they appear in Prometheus as 0 from the first poll cycle
  • Fixes two metrics (credit_failures_total, errors_total) that were invisible to Prometheus because their error paths have never been triggered in production

Problem

The metrics crate lazily registers counters — they only exist in the /metrics endpoint after the first .increment() call. Since dwctl_auto_topup_credit_failures_total (P1: user charged but credits not recorded) and dwctl_auto_topup_errors_total (idempotency/payment method lookup failures) have never fired in production, they don't exist in Prometheus at all.

This causes two issues:

  1. Grafana alerts referencing these metrics evaluate to "no data" instead of 0, so the P1 credit-failure alert and P2 infra-failure alert would never fire even if the error paths were hit for the first time
  2. Dashboard panels that divide by a sum of all auto top-up counters return NaN because part of the denominator is missing

Fix

Obtain Counter handles (without incrementing) at the top of process_auto_topups(). This registers them with the Prometheus exporter as 0 on the first poll cycle. The counters then exist continuously and will correctly increment when error paths are hit.

let _ = counter!("dwctl_auto_topup_success_total");
let _ = counter!("dwctl_auto_topup_charge_failures_total");
let _ = counter!("dwctl_auto_topup_credit_failures_total");
let _ = counter!("dwctl_auto_topup_errors_total", "stage" => "idempotency_check");
let _ = counter!("dwctl_auto_topup_errors_total", "stage" => "payment_method_lookup");

Test plan

  • cargo check passes cleanly (no warnings)
  • After deploy, verify all four dwctl_auto_topup_* metrics appear in Prometheus with value 0
  • Verify Grafana alerts for auto top-up evaluate to "OK" (not "no data")
  • Verify the Operations Overview dashboard "Auto Top-Up Activity" panel renders without NaN

The metrics crate lazily registers counters — they only appear in the
Prometheus /metrics endpoint after the first .increment() call. Two of
the four auto top-up counters (credit_failures and errors) have never
been triggered in production, so they don't exist in Prometheus at all.

This breaks Grafana alerts that reference these metrics (they evaluate
to "no data" instead of 0) and causes NaN in dashboard panels that
divide by a sum including missing metrics.

Fix by obtaining Counter handles at the top of process_auto_topups(),
which registers them with the Prometheus exporter as 0 on the first
poll cycle, well before any error path needs to fire.
Copilot AI review requested due to automatic review settings March 6, 2026 18:02
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying control-layer with  Cloudflare Pages  Cloudflare Pages

Latest commit: ebb0a98
Status: ✅  Deploy successful!
Preview URL: https://75300c5e.control-layer.pages.dev
Branch Preview URL: https://fix-pre-register-auto-topup.control-layer.pages.dev

View logs

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ensures auto top-up Prometheus counters are visible immediately (as 0) by pre-registering the dwctl_auto_topup_* counter series when process_auto_topups() starts, preventing “no data” alert/panel behavior in Grafana/Prometheus when certain error paths haven’t occurred yet.

Changes:

  • Pre-register dwctl_auto_topup_success_total, dwctl_auto_topup_charge_failures_total, dwctl_auto_topup_credit_failures_total, and dwctl_auto_topup_errors_total{stage=...} counter series at the start of process_auto_topups().
  • Ensures the two previously never-triggered error counters are exported from the first poll cycle onward.

@pjb157 pjb157 closed this Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants