device-health-oracle: detect link impairment from monitoring data#3678
Draft
device-health-oracle: detect link impairment from monitoring data#3678
Conversation
Add bidirectional LinkHealth transitions backed by the existing link_rollup_5m ClickHouse table: - ReadyForService -> Impaired when the most recent rollup bucket has ISIS down or per-direction packet loss above --link-loss-threshold (default 5%). - Impaired -> ReadyForService only after every bucket in the recovery window (reuses --drained-slot-count) is clean. The asymmetry — fast demote, slow recover — keeps borderline links from flapping while still surfacing real impairment quickly. LinkHealth is a signal only; the serviceability program does not gate link.status on it. Refs #2652
Architecture review (HIGH): - Recovery SQL counts duplicate ingested_at rows rather than distinct buckets. Wrap in argMax-per-bucket subquery so a corrected late row cannot keep an Impaired link stuck even after every distinct bucket reads as clean. - LinkHealthRecent had no recency floor — a stale latest bucket (telemetry pipeline broken) could indefinitely demote/keep a link impaired. Return bucket_ts and treat anything older than 15 minutes (3x rollup cadence) as no data. Architecture review (MEDIUM): - Increment MetricCriterionResults symmetrically in checkAllLink so link impairment/recovery rates are graphable. Security review (LOW): - Validate --link-loss-threshold is finite and within [0, 100] at startup; fail fast on misconfiguration. Plus debuggability: include bucket timestamps in impairment fail reasons; surface bad/total bucket counts in recovery debug logs and the failure reason. Refs #2652
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of Changes
LinkHealthtransitions to the device-health-oracle. The DHO previously had zeroLinkCriterions configured — links auto-advanced from Pending to ReadyForService and stayed there forever, even when ISIS was down or every packet was being dropped. This PR adds the demotion path (and a slow recovery path back).ReadyForService → Impairedtriggers on the most recentlink_rollup_5mbucket (fast detection, single bucket).Impaired → ReadyForServicerequires every distinct bucket in the recovery window to be clean (slow recovery, anti-flap).--drained-slot-count(~30 min default) — no new burn-in flag introduced. New--link-loss-thresholdflag (default 5.0) controls the per-direction loss cutoff.LinkHealthis reported as a signal only; the serviceability program does not gatelink.statuson it.Implementation notes
Impairedforever even after the link recovered.argMax(..., ingested_at)per bucket, so a corrected re-write doesn't keep a linkImpairedafter every distinct bucket reads as clean.--link-loss-thresholdis validated at startup (must be finite and in[0, 100]) — a misconfiguration would otherwise produce an onchain write storm.MetricCriterionResultsis now incremented for link criteria too (parity with device criteria) so impairment/recovery rates are graphable.Testing Verification
LinkBurnInhelper.provisioning = falsefilter, double-quoted database name, and?bind parameters.link_rollup_5mdata via the DoubleZero MCP — known impaired linkDzHDqj3cdi77eMLWKemdhfr6YZJeHHGxuysvAdekniCreturnsbad=5/total=5over the last 30 min and the latest bucket readsisis_down=true, a_loss=100, z_loss=100, matching what the issue reports.make go-test,golangci-lint, andgo vetall clean.