Skip to content

Validation improvements#66

Merged
almaslennikov merged 9 commits into
mainfrom
validation-improvements
May 24, 2026
Merged

Validation improvements#66
almaslennikov merged 9 commits into
mainfrom
validation-improvements

Conversation

@almaslennikov
Copy link
Copy Markdown
Collaborator

No description provided.

almaslennikov and others added 9 commits May 23, 2026 19:13
Two related additions on the validate side:

(1) pkg/presetmatch/ — new package that wraps presets.LoadPreset +
presets.ValidatePreset into a single MatchGroup / MatchAll call.
Result carries Status (match / deviation / not-found / skipped),
the matched preset name, and the deviation slice. Discovery now
delegates its preset lookup through this package (LoadPreset +
ApplyPreset stay in pkg/presets — only the lookup-vs-spec logic
moved). l8k validate calls MatchAll over the cluster-config.yaml's
groups at validate-time so cluster drift is re-evaluated even when
the user didn't re-run discover.

Preset deviations stay informational — they don't gate the exit
code. They surface as a sub-table in the HTML report's "Node
groups" section and as a "Preset match" block in text output.

(2) PASS/FAIL banner at the top of both text output and the HTML
report. computeOverallVerdict folds every gating check (Helm
release version, component versions, manifest verdict, connectivity
matrix) into an OverallVerdict{Pass, Reasons, Notes}. Pass=false
when any gate fails — same condition that drives the exit code.
Notes carry informational items that don't gate (preset deviations,
in-progress manifests when --wait wasn't used, matrix soft-skip);
those don't downgrade PASS but the operator sees them.

Banner format:

  Text:  a wide ═-bordered box at the top of the report with
         a bulleted list of failure reasons (Reasons) and an
         ⓘ-prefixed list of notes (Notes). Green/red text via
         the existing color helpers.
  HTML:  full-width div at the top of <main>, color-coded
         green (.verdict-pass) or red (.verdict-fail), with
         the same Reasons / Notes lists rendered as <ul>.

exitWithReport (the early-error wrapper) also emits the banner +
report, with a pessimistic FAIL verdict carrying the error message
as the only Reason. Operators on an early failure get the report
file AND the in-terminal banner explaining what went wrong.

Golden fixture rewritten to populate Verdict.Pass=false +
PresetMatches with one StatusDeviation entry so both the banner
and the new preset sub-table are exercised on every regression run.

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
Offline-safe pieces only — data model, parsers, planner, unit tests.
The RunMatrix orchestrator wiring + per-test SPDY exec runs land in
a follow-up once the test cluster is back. Anything we commit here
is exercised by the existing test suite without needing kubectl.

Data model:
- PingTestKind enum extended: RDMAPingSameRail / RDMAPingCrossRail
  (rping) and RDMABwSameRail / RDMABwCrossRail (ib_write_bw). Kind
  predicates (IsICMP, IsRDMAPing, IsRDMABw, IsCrossRail) so the
  renderer + orchestrator can branch by family without hard-coding.
- PingTest gained SrcRDMADev / DstRDMADev — populated for RDMA test
  kinds; empty for ICMP.
- PingResult gained BandwidthGbps + MsgRateMpps — populated by
  parseIbWriteBwOutput. PacketLoss stays at -1 ("n/a") for RDMA tests.
- TestPod gained RDMADevsByRail — map[rail]→<mlx5_X> populated by
  DiscoverRDMADevices.

New rdma.go:
- DiscoverRDMADevices(ctx, pod, ifaceByRail) — single shell exec per
  pod that reads /sys/class/net/<iface>/device/infiniband/ for each
  multus secondary network interface. Returns map[rail]→<rdmaDev>.
- RunRPing — spawns `rping -s -a <ip> -p 9999` as a background
  process on the server pod, sleeps the settle window, runs `rping
  -c -a <ip> -p 9999 -C N` on the client pod, then kills the
  server. rping returns 0 only on a clean handshake — binary
  pass/fail is all we surface (no useful numeric output).
- RunIbWriteBw — same lifecycle pattern. Uses `-d <dev> -R -s 65536
  --report_gbits` on both sides (matches the example in
  network-operator-docs/quick-start/sriov-network-rdma.rst).
  parseIbWriteBwOutput reads the single summary row, surfaces peak
  Gbps + MsgRate.
- killRDMAServer — best-effort PID + pkill -f cleanup with a
  fresh background context so cleanup runs even when the test
  context already deadlined.

New planner entry point:
- RDMAPlan(pods, *plan) extends an existing ICMP plan with rping
  + ib_write_bw test variants for the same pod set. Pairs without
  RDMA devices on both sides (DiscoverRDMADevices couldn't resolve
  one of them) are silently dropped — no `-d <dev>` to pass.
  Server/client direction per the user-chosen "both directions like
  ping" shape: every ordered pod pair gets its own pairing.

Tests:
- TestParseIbWriteBwOutput — drives the regex against a realistic
  multi-line client output (header banner, address block, summary
  row, trailing banner), plus malformed / no-summary / sub-Gbps
  cases.
- TestRDMAPlan — covers the same-rail count, cross-rail canary,
  missing-device-skip, and ICMP-skip-preserved cases. Includes the
  subtle asymmetry: when one pod lacks the rail-1 device, the
  cross-rail canary still emits for (other-pod.rail-0 →
  one-pod.rail-1) but not the reverse.
- TestPingTestKind_Predicates — guards the new IsX() helpers
  against future Kind enum changes.

What's still missing (cluster work):
- RunMatrix needs to call DiscoverRDMADevices, RDMAPlan, then
  iterate RDMASameRail / RDMABwSameRail / cross variants
  sequentially (per "Spawn fresh server per test" decision).
- text_report.go + report.html.tmpl need per-kind matrix grids so
  the bandwidth numbers actually surface in the report.

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
The fabric probe was rejecting any port that wasn't ACTIVE (and for
IB additionally requiring sm_lid != 0). That broke discovery on
freshly provisioned clusters where the switch isn't plugged in yet
— `l8k generate` then errored with "no fabric resolved" even on
hardware whose configured fabric was obvious from sysfs.

Simplified rule: just read
`/sys/class/infiniband/<dev>/ports/<port>/link_layer` and normalise.
"Ethernet" → Ethernet, "InfiniBand" → InfiniBand, anything else (or
unreadable file) → no verdict. The /sys-then-/host/sys fallback path
from the earlier commit (cf1835b) is preserved.

Dropped:
- state/phys_state reads (we no longer care about runtime link state)
- sm_lid read + smLidIsNonZero helper (IB subnet manager presence
  was only used to corroborate an ACTIVE+IB state)
- The "first-path clean read but unconfirmed → don't retry"
  short-circuit (no clean-but-unconfirmed case exists any more —
  either link_layer is recognised or it isn't)
- strconv import (only smLidIsNonZero used it)

Tradeoff: an operator who reflashes a card to a different fabric
without re-running discover would carry the old link_layer in
cluster-config.yaml until the next discover. Acceptable in
exchange for the "discover before the cluster is wired up" win —
that scenario was the dominant source of "no fabric resolved"
errors per recent user reports.

Tests rewritten for the new semantics: previously-asserted
"DOWN port → no verdict" now asserts "DOWN port still resolves
by link_layer". The Ethernet / InfiniBand / empty / unrecognised /
whitespace-trimming cases continue to hold.

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
The post-apply RV gate (commit c998fa2) was hanging the deploy
state machine on SriovNetworkNodePolicy:

    SriovNetworkNodePolicy/ethernet-sriov-rail-0 in network-operator [3/6]:
      waiting for controller to observe new spec (3m11s)

even though `kubectl get sriovnetworknodestate -n network-operator`
showed both nodes at `Succeeded`. Root cause: the SR-IOV operator
treats SriovNetworkNodePolicy as **config-only** — it never writes
status back onto the policy CR. The operator's reconcile output
lives on per-node SriovNetworkNodeState objects, which the
validator already reads. The policy's resourceVersion stays pinned
at the apply-time value forever, and the gate sits there waiting
for a write that will never land.

Same trap applies to NicInterfaceNameTemplate /
NicConfigurationTemplate — their validators read companion
NicDevice CRs, not the template itself.

Fix: gate only the Kinds whose validators read .status from the
CR itself. New crstate.NeedsObservationGate(gvk) returns true for:

  - mellanox.com/v1alpha1 NicClusterPolicy + NicNodePolicy
    (read .status.state on the policy)
  - mellanox.com/v1alpha1 HostDeviceNetwork / IPoIBNetwork /
    MacvlanNetwork (same status-string validator)
  - spectrumx.nvidia.com/v1alpha2 SpectrumXRailPoolConfig
    (reads .status.syncStatus on the CR)

Returns false for everything else (SR-IOV node policies,
nic-configuration-operator templates, existence-only Kinds like
IPPool / SriovNetwork). Those Kinds' validators carry their own
staleness handling — the SR-IOV soft-progress bucketing (commit
aa48c78) covers "Succeeded but numVfs not at target yet" without
needing an RV gate on the policy itself.

applyAndWait and the phase-3 batched-apply loop both consult the
helper before setting awaitObservationAfterRV, so the gate is a
no-op for Kinds that would deadlock on it.

Tests: a 12-case TestNeedsObservationGate covers each gated Kind,
each companion-CR Kind, the existence-only Kinds, and the wrong-
group / wrong-version negatives.

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
Three pieces of the rping + ib_write_bw work that don't need the
cluster to be reachable, finished offline so cluster iteration only
verifies real exec output.

1. RunMatrix orchestrator: after the ICMP stage builds its
   PingResults, build the set of (srcPod, dstPod, rail) triples that
   PASSED ICMP. Call RDMAPlan; iterate rping then ib_write_bw
   sequentially (per the "Spawn fresh server per test" lifecycle
   decision — concurrent runs would fight over the listener port).
   Skip same-rail RDMA tests on pairs that failed ICMP; cross-rail
   canaries always run (they probe a different path than ICMP).
   buildTestPod now also returns the rail→multus-interface mapping
   so DiscoverRDMADevices can populate TestPod.RDMADevsByRail —
   each pod gets a single in-pod shell exec reading
   /sys/class/net/<iface>/device/infiniband/ per multus iface.

2. Text renderer (RenderMatrixText): bucket results by (rail × kind
   family) rather than just by rail. One grid per (rail, family);
   families are ICMP / rping / ib_write_bw. Cells:
     - ICMP        — "✓ 0% 0.1ms" / "✗ 100%" / "✗ ERR"
     - rping       — "✓" / "✗"
     - ib_write_bw — "✓ 194.4 Gbps" / "✗ ERR"
   Cross-rail canary list is also per-family. Section headers
   include the family title so the layout is "Rail rail-0 — ICMP
   ping:", "Rail rail-0 — RDMA ping (rping):", etc.

3. HTML report template + helpers: same per-family layout. New
   crossRailSection type so the template iterates one cross-rail
   list per family. cellText template helper now takes a family
   arg; cellClass stays family-agnostic (pass/fail/missing).
   matrixByRail + matrixCrossRail rewritten to bucket by family
   and emit a stable ordering (ICMP → rping → ib_write_bw within
   each rail).

Tests:
- TestCellFor extended with rping ✓/✗ and ib_write_bw bandwidth
  cases (including the defensive "OK=true, BandwidthGbps=0 → ERR"
  branch).
- TestKindFamilyOf locks in the PingTestKind → kindFamily mapping.
- TestRenderMatrixText updated for the new section header format.
- Golden HTML regenerated.

What's still cluster-bound: a real run of `l8k validate
--connectivity` against the user's cluster to verify the
ib_write_bw output regex matches the actual perftest version
deployed (sandbox network policy is currently blocking the
cluster API server; user needs to allow the IP via `sbx policy
allow network -g <ip>` before I can do the verification run).

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
…t-host

The old mellanox/rping-test:latest image ships rping + ib_write_bw +
ibv_devices, but not iputils — so any ICMP exec inside the pod fails
with `exec: "ping": executable file not found in $PATH`. With the
ICMP stage now gone from the connectivity matrix this no longer
matters for the matrix proper, but the DOCA full-rt-host image is a
more representative test fixture: it ships the same RDMA tools plus
a full RoCE toolchain (rdma-core, perftest, iputils), so operators
who shell into the pods for manual debugging have a richer
environment.

Spectrum-X profiles (spectrum-x, spectrum-x-ra2.1) additionally need
an explicit `command: ["/bin/bash", "-c", "sleep infinity"]` because
the DOCA image's default entrypoint exits immediately, unlike
mellanox/rping-test's. The other profiles already had this command
and were unaffected.

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
version + platform-topology checks

Three coordinated changes to the `l8k validate` UX. Each section
below is independent, but they share enough surface area in
validate.go + the connectivity package that they're bundled into
one commit.

1. RDMA-only matrix (drop ICMP)

   The ICMP stage in the matrix existed as a gating signal for the
   subsequent RDMA stages — but ICMP working tells you nothing
   about whether the RDMA-CM path is wired up correctly (different
   netdev → device mapping, GID table state, etc.). Running rping
   + ib_write_bw directly is both more diagnostic and the only
   thing operators actually care about for an east-west deployment.

   - pkg/networkoperatorplugin/connectivity/ping.go deleted; RunPing
     gone, pingSummaryRe / pingRTTRe gone.
   - PingResult moved to result.go and shrunk: PacketLoss + RTTAvgMs
     dropped, ib_write_bw fields (BandwidthGbps, MsgRateMpps)
     stay.
   - PingTestKind collapsed from 6 → 4 values (RDMAPingSameRail,
     RDMAPingCrossRail, RDMABwSameRail, RDMABwCrossRail). IsICMP /
     PingSameRail / PingCrossRail removed.
   - Plan() now emits RDMA tests directly (no more RDMAPlan as a
     second pass). MatrixPlan exposes only the four RDMA slices.
   - RunMatrix runs two stages sequentially per the "spawn fresh
     server per test" lifecycle — no more concurrent ICMP fan-out,
     no more passedSameRail ICMP gate. Same-rail RDMA tests now
     run unconditionally; cross-rail canaries continue to always
     run.
   - Renderers (text + HTML) collapse to two families (rping →
     ib_write_bw) and drop the ICMP grid + the Summary.ExecErrors
     count.
   - Options.PingCount + Options.MaxConcurrentPings removed;
     `--ping-count` flag deleted from validate. `--connectivity-
     timeout` help text updated to mention rping + ib_write_bw.

2. Report file + flag renames

   - <deployment-files>/verify-report.html → <deployment-files>/
     k8s-launch-kit-validation-report.html (override via
     --report-path still works).
   - `l8k deploy --verify` → `l8k deploy --validate` (deployVerify
     → deployValidate, error string "deploy succeeded but --verify
     failed" → "… --validate failed").
   - testdata/verify-report.golden.html → testdata/k8s-launch-kit-
     validation-report.golden.html.
   - template.New("verify-report") → template.New("validation-
     report") (internal; doesn't surface anywhere).
   - README updated for both renames.

3. Non-blocking version + topology checks; "Platform topology
   validation" UX rename

   Previously version-mismatch and component-mismatch short-
   circuited the validate flow before connectivity ran. A stale
   Helm release or hardware drift from the catalog topology
   doesn't actually mean the data plane can't be tested — the
   cluster is still up, the operator wants to know whether RDMA
   works. The flow now runs the matrix even on a mismatch and
   only fails the final exit code once everything has run.

   - validate.go switch on verdict.HasError || verdict.HasMissing
     keeps short-circuit semantics (manifest-broken). VersionOK
     and componentMismatch no longer block the matrix; final
     exit-code check unifies all gating signals (matrix Failed,
     version mismatch, component mismatch, platform-topology
     deviation) and exits ExitDeployment on any.
   - computeOverallVerdict treats platform-topology deviation as a
     fail reason (was: informational note), so PASS/FAIL banner
     reflects topology drift.
   - User-facing "preset" terminology renamed end-to-end:
       Section heading: "Preset match (cluster groups vs. topology
       preset catalog)" → "Platform topology validation"
       Status labels: MATCH / DEVIATION / NO PRESET / SKIPPED →
       MATCH / MISMATCH / NOT CERTIFIED / SKIPPED
       Per-row detail: "preset 'X' matches" → "matches certified
       topology for X server type"
       Banner reason: "The detected platform topology does not
       match the certified topology for <server-type> server
       type: <field deltas>"
   - Internal package names (presetmatch, struct fields like
     PresetMatches / PresetDeviation / --for <preset> / `l8k
     preset list,update`) intentionally unchanged — those are
     code/admin surfaces, not validation output.

Tests + golden file updated; go build ./... and go test ./...
green across all 12 packages.

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
The flag was renamed from --verify to --validate in the prior commit,
but "--validate" overloads the `l8k validate` subcommand name and
sets the wrong expectation — `l8k deploy --validate` does NOT run
the full validate pipeline (Helm release check, manifest state
classification, platform-topology check), it only chains the
connectivity matrix. --test-connectivity names what the flag
actually does and leaves "validate" to mean the subcommand.

No behavior change; deployValidate → deployTestConnectivity in the
flag var, help text and error string updated accordingly.

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
…cted PF tables

The previous one-table layout tried to cram the diff inline ("1021
certified topology: 1018"), with unexpected/missing rows getting
strikethrough + tag chrome. The result was visually noisy and hard
to scan — operators couldn't tell what the certified topology
actually said for a given PCI without parsing badges.

This commit splits each PF group's rendering into two parallel
sub-tables:

  East-west PFs — Actual (3 certified topology expects 4)
    PCI            Device ID  Rail  Netdev  ...
    0000:08:00.0   1021       0     eth_r0  ...    ← tinted (deviceID drift)
    0000:08:00.1   1023       1     eth_r1  ...
    0000:08:00.2   1023       2     eth_r2  ...    ← tinted (not in cert. topo)

  East-west PFs — Expected by certified topology (3)
    PCI            Device ID  Rail  Netdev  ...
    0000:08:00.0   1023       0     eth_r0  ...    ← tinted (drift)
    0000:08:00.1   1023       1     eth_r1  ...
    0000:08:00.3   1023       3     eth_r3  ...    ← tinted (cluster lacks it)

Rows that diverge from their counterpart in the other table are
tinted amber with a faint amber top/bottom border in both tables.
The reader scans down each side and the divergences line up.

Implementation:

- PFInfo: dropped ExpectedDeviceID + Unexpected + Missing fields,
  replaced by a single Mismatched bool. NodeGroupInfo gained
  ExpectedEastWestPFs + ExpectedNorthSouthPFs slices carrying the
  certified topology's PFs.
- presetmatch.Result gained Preset *presets.Topology (and a
  Manufacturer string), populated from the loaded preset on a
  match. validate.go uses Result.Preset to project preset PFs into
  the Expected sub-tables and to flag Mismatched rows in each.
- platformLabel prepends Manufacturer when set, so the banner
  reads "Test-Manufacturer-Standard-PC-(Q35-+-ICH9,-2009)-Test-GPU"
  rather than "Standard-PC-(Q35-+-ICH9,-2009)/Test-GPU".
- report.html.tmpl renders the four sub-tables (Actual+Expected
  for east-west and north-south) and uses a single pf-mismatched
  row class for the tint. The pf-tag / pf-missing / pf-unexpected
  / strikethrough / italic styles + the mismatch-inline cell text
  are gone — the row tint is the only diff signal.
- The fallback path (no live preset, e.g. validate run against
  stored cluster-config deviations) still flags Actual rows whose
  PCI appears in any stored deviation, but the Expected tables
  stay empty in that case.

Test fixture updated: one drift row + one extra row + one missing
row exercise both tables and the Mismatched flagging on each.

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
@almaslennikov almaslennikov merged commit b5f17fc into main May 24, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant