device-health-oracle: detect link impairment from monitoring data by nikw9944 · Pull Request #3678 · malbeclabs/doublezero

nikw9944 · 2026-05-06T16:42:25Z

Summary of Changes

Add bidirectional LinkHealth transitions to the device-health-oracle. The DHO previously had zero LinkCriterions configured — links auto-advanced from Pending to ReadyForService and stayed there forever, even when ISIS was down or every packet was being dropped. This PR adds the demotion path (and a slow recovery path back).
ReadyForService → Impaired triggers on the most recent link_rollup_5m bucket (fast detection, single bucket). Impaired → ReadyForService requires every distinct bucket in the recovery window to be clean (slow recovery, anti-flap).
Recovery window reuses the existing --drained-slot-count (~30 min default) — no new burn-in flag introduced. New --link-loss-threshold flag (default 5.0) controls the per-direction loss cutoff.
LinkHealth is reported as a signal only; the serviceability program does not gate link.status on it.
Fixes device-health-oracle: detect link impairment from monitoring data and update link.health #2652

Implementation notes

Stale-data floor: a "latest" bucket older than 15 min (3× rollup cadence) is treated as no data. Without this, a frozen telemetry pipeline at the moment of an ISIS flap would keep the link Impaired forever even after the link recovered.
Recovery query deduplicates late-arriving rows by argMax(..., ingested_at) per bucket, so a corrected re-write doesn't keep a link Impaired after every distinct bucket reads as clean.
--link-loss-threshold is validated at startup (must be finite and in [0, 100]) — a misconfiguration would otherwise produce an onchain write storm.
MetricCriterionResults is now incremented for link criteria too (parity with device criteria) so impairment/recovery rates are graphable.
Backwards compatible: deployments without ClickHouse continue to behave exactly as before — no demotion, no recovery, no impairment criterion is wired up.

Testing Verification

New unit tests cover all six transition cases (Pending/RFS/Impaired × pass/fail), the impairment-mode and recovery-mode criterion paths, the stale-bucket floor, threshold boundaries, no-data handling in both modes, ClickHouse error propagation, and the LinkBurnIn helper.
New ClickHouse tests assert the actual SQL shape: argMax-per-bucket dedupe, provisioning = false filter, double-quoted database name, and ? bind parameters.
Manually validated both queries against mainnet link_rollup_5m data via the DoubleZero MCP — known impaired link DzHDqj3cdi77eMLWKemdhfr6YZJeHHGxuysvAdekniC returns bad=5/total=5 over the last 30 min and the latest bucket reads isis_down=true, a_loss=100, z_loss=100, matching what the issue reports.
make go-test, golangci-lint, and go vet all clean.

Add bidirectional LinkHealth transitions backed by the existing link_rollup_5m ClickHouse table: - ReadyForService -> Impaired when the most recent rollup bucket has ISIS down or per-direction packet loss above --link-loss-threshold (default 5%). - Impaired -> ReadyForService only after every bucket in the recovery window (reuses --drained-slot-count) is clean. The asymmetry — fast demote, slow recover — keeps borderline links from flapping while still surfacing real impairment quickly. LinkHealth is a signal only; the serviceability program does not gate link.status on it. Refs #2652

Architecture review (HIGH): - Recovery SQL counts duplicate ingested_at rows rather than distinct buckets. Wrap in argMax-per-bucket subquery so a corrected late row cannot keep an Impaired link stuck even after every distinct bucket reads as clean. - LinkHealthRecent had no recency floor — a stale latest bucket (telemetry pipeline broken) could indefinitely demote/keep a link impaired. Return bucket_ts and treat anything older than 15 minutes (3x rollup cadence) as no data. Architecture review (MEDIUM): - Increment MetricCriterionResults symmetrically in checkAllLink so link impairment/recovery rates are graphable. Security review (LOW): - Validate --link-loss-threshold is finite and within [0, 100] at startup; fail fast on misconfiguration. Plus debuggability: include bucket timestamps in impairment fail reasons; surface bad/total bucket counts in recovery debug logs and the failure reason. Refs #2652

nikw9944 added 2 commits May 6, 2026 16:41

nikw9944 mentioned this pull request May 6, 2026

device-health-oracle: detect link impairment from monitoring data and update link.health #2652

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

device-health-oracle: detect link impairment from monitoring data#3678

device-health-oracle: detect link impairment from monitoring data#3678
nikw9944 wants to merge 2 commits intomainfrom
nikw9944/doublezero-2652

nikw9944 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nikw9944 commented May 6, 2026

Summary of Changes

Implementation notes

Testing Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant