Skip to content

device-health-oracle: detect link impairment from monitoring data#3678

Draft
nikw9944 wants to merge 2 commits intomainfrom
nikw9944/doublezero-2652
Draft

device-health-oracle: detect link impairment from monitoring data#3678
nikw9944 wants to merge 2 commits intomainfrom
nikw9944/doublezero-2652

Conversation

@nikw9944
Copy link
Copy Markdown
Contributor

@nikw9944 nikw9944 commented May 6, 2026

Summary of Changes

  • Add bidirectional LinkHealth transitions to the device-health-oracle. The DHO previously had zero LinkCriterions configured — links auto-advanced from Pending to ReadyForService and stayed there forever, even when ISIS was down or every packet was being dropped. This PR adds the demotion path (and a slow recovery path back).
  • ReadyForService → Impaired triggers on the most recent link_rollup_5m bucket (fast detection, single bucket). Impaired → ReadyForService requires every distinct bucket in the recovery window to be clean (slow recovery, anti-flap).
  • Recovery window reuses the existing --drained-slot-count (~30 min default) — no new burn-in flag introduced. New --link-loss-threshold flag (default 5.0) controls the per-direction loss cutoff.
  • LinkHealth is reported as a signal only; the serviceability program does not gate link.status on it.
  • Fixes device-health-oracle: detect link impairment from monitoring data and update link.health #2652

Implementation notes

  • Stale-data floor: a "latest" bucket older than 15 min (3× rollup cadence) is treated as no data. Without this, a frozen telemetry pipeline at the moment of an ISIS flap would keep the link Impaired forever even after the link recovered.
  • Recovery query deduplicates late-arriving rows by argMax(..., ingested_at) per bucket, so a corrected re-write doesn't keep a link Impaired after every distinct bucket reads as clean.
  • --link-loss-threshold is validated at startup (must be finite and in [0, 100]) — a misconfiguration would otherwise produce an onchain write storm.
  • MetricCriterionResults is now incremented for link criteria too (parity with device criteria) so impairment/recovery rates are graphable.
  • Backwards compatible: deployments without ClickHouse continue to behave exactly as before — no demotion, no recovery, no impairment criterion is wired up.

Testing Verification

  • New unit tests cover all six transition cases (Pending/RFS/Impaired × pass/fail), the impairment-mode and recovery-mode criterion paths, the stale-bucket floor, threshold boundaries, no-data handling in both modes, ClickHouse error propagation, and the LinkBurnIn helper.
  • New ClickHouse tests assert the actual SQL shape: argMax-per-bucket dedupe, provisioning = false filter, double-quoted database name, and ? bind parameters.
  • Manually validated both queries against mainnet link_rollup_5m data via the DoubleZero MCP — known impaired link DzHDqj3cdi77eMLWKemdhfr6YZJeHHGxuysvAdekniC returns bad=5/total=5 over the last 30 min and the latest bucket reads isis_down=true, a_loss=100, z_loss=100, matching what the issue reports.
  • make go-test, golangci-lint, and go vet all clean.

nikw9944 added 2 commits May 6, 2026 16:41
Add bidirectional LinkHealth transitions backed by the existing
link_rollup_5m ClickHouse table:

- ReadyForService -> Impaired when the most recent rollup bucket has
  ISIS down or per-direction packet loss above --link-loss-threshold
  (default 5%).
- Impaired -> ReadyForService only after every bucket in the recovery
  window (reuses --drained-slot-count) is clean.

The asymmetry — fast demote, slow recover — keeps borderline links from
flapping while still surfacing real impairment quickly. LinkHealth is a
signal only; the serviceability program does not gate link.status on it.

Refs #2652
Architecture review (HIGH):
- Recovery SQL counts duplicate ingested_at rows rather than distinct
  buckets. Wrap in argMax-per-bucket subquery so a corrected late row
  cannot keep an Impaired link stuck even after every distinct bucket
  reads as clean.
- LinkHealthRecent had no recency floor — a stale latest bucket
  (telemetry pipeline broken) could indefinitely demote/keep a link
  impaired. Return bucket_ts and treat anything older than 15 minutes
  (3x rollup cadence) as no data.

Architecture review (MEDIUM):
- Increment MetricCriterionResults symmetrically in checkAllLink so
  link impairment/recovery rates are graphable.

Security review (LOW):
- Validate --link-loss-threshold is finite and within [0, 100] at
  startup; fail fast on misconfiguration.

Plus debuggability: include bucket timestamps in impairment fail
reasons; surface bad/total bucket counts in recovery debug logs and
the failure reason.

Refs #2652
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

device-health-oracle: detect link impairment from monitoring data and update link.health

1 participant