Skip to content

deploy: install Network Operator helm chart in-process with preflight drift checks#69

Merged
almaslennikov merged 1 commit into
mainfrom
helm-chart-install
May 24, 2026
Merged

deploy: install Network Operator helm chart in-process with preflight drift checks#69
almaslennikov merged 1 commit into
mainfrom
helm-chart-install

Conversation

@almaslennikov
Copy link
Copy Markdown
Collaborator

Folds the manual helm install nvidia/network-operator prerequisite into l8k deploy so users no longer have to run helm by hand. Adds a preflight phase that detects and (with --overwrite-existing) remediates cluster drift across four signals shared between deploy gating and validate reporting.

What

  • l8k generate renders a per-profile values.yaml next to the CR manifests (new 00-values.yaml templates in each profile). Chart version, repo URL, operator and component image repositories all come from the embedded release catalog.
  • l8k deploy Phase 0 installs/upgrades the network-operator chart via the Helm Go SDK (helm.sh/helm/v3). No shell-out, no helm repo add side-effects — fetches the chart tgz into a temp dir per run.
  • Phase 0.5 runs the new preflight package: chart-version diff, user-values diff, NicClusterPolicy/NicNodePolicy component-version diff, and stray-CR detection (Network Operator-managed Kinds in the operator namespace that l8k did not render — excludes service CRs like NicDevice/SriovNetworkNodeState/SriovOperatorConfig). Without --overwrite-existing: fail fast listing every mismatch. With it: helm upgrade the chart, delete strays, let Phase 1 SSA overwrite component versions.
  • l8k validate consumes the same preflight checks and renders them into the HTML report's "Network Operator release" section. Soft fails — contributes to the verdict but never gates execution.

How

  • New pkg/networkoperatorplugin/preflight/ is one source of truth for the four checks. Both deploy (gating) and validate (read-only) consume the same Result shape.
  • New pkg/networkoperatorplugin/helmclient/ extracts the action.Configuration wiring so deploy's Phase 0 and the preflight helm checks share one kube-client setup.
  • New pkg/cmd/userconfig.go centralises cluster-config.yaml discovery + catalog application; generate, deploy, and validate now use the same lookup chain (--user-config > ./cluster-config.yaml > /../cluster-config.yaml

    /cluster-config.yaml > ./l8k-config.yaml >
    /l8k-config.yaml). Standalone l8k deploy previously didn't load the user config at all, which made every preflight check soft-skip.

  • Release catalog adds helmRepoURL and operatorRepository per release line. The operator binary image lives at nvcr.io/nvidia/cloud-native for stable releases vs. components at nvcr.io/nvidia/mellanox; staging keeps both under nvcr.io/nvstaging/mellanox.
  • --overwrite-existing flag on both l8k deploy and l8k generate --deploy.
  • Stuck-release detection: helm refuses every operation when the release is in pending-install/pending-upgrade/pending-rollback (typically a crashed previous run). We catch this upfront and point at helm rollback/helm uninstall instead of letting helm's mid-flight error string surface.
  • sriov-enabled profiles set sriov-network-operator.sriovOperatorConfig.deploy: true — the upstream subchart defaults to false, which leaves the "default" SriovOperatorConfig CR missing and stalls every SriovNetworkNodePolicy reconcile.
  • Klog + client-go HTTP-299 warning handler now routed through controller-runtime's logger at V(2): silent at info, visible at --log-level=debug. Stops I0524 ... warnings.go:110] noise from leaking to stderr.
  • Structured errors carry actionable messages plus a Suggestion: line printed by exitWithError. No more refusing phrasing; no double-wrapping of structured errors.

Docs

  • Per the repo's documentation discipline: CLAUDE.md gains "Helm chart install (Phase 0)" section; README.md, l8k-config.yaml sample, and the deploy/generate/config skills updated.

Tests

  • 23 new tests in pkg/networkoperatorplugin/preflight/ covering each check + the runner + Remediate.
  • 3 new tests in pkg/networkoperatorplugin/ for the chart-version gate and the stuck-release detection (in-memory helm storage driver, no real cluster needed).
  • Existing ApplyOptionsToConfig tests extended for the new HelmRepoURL / OperatorRepository fields.

… drift checks

Folds the manual `helm install nvidia/network-operator` prerequisite into
`l8k deploy` so users no longer have to run helm by hand. Adds a preflight
phase that detects and (with --overwrite-existing) remediates cluster drift
across four signals shared between deploy gating and validate reporting.

What

  * `l8k generate` renders a per-profile `values.yaml` next to the CR
    manifests (new `00-values.yaml` templates in each profile). Chart
    version, repo URL, operator and component image repositories all
    come from the embedded release catalog.
  * `l8k deploy` Phase 0 installs/upgrades the network-operator chart
    via the Helm Go SDK (`helm.sh/helm/v3`). No shell-out, no
    `helm repo add` side-effects — fetches the chart tgz into a temp
    dir per run.
  * Phase 0.5 runs the new `preflight` package: chart-version diff,
    user-values diff, NicClusterPolicy/NicNodePolicy component-version
    diff, and stray-CR detection (Network Operator-managed Kinds in
    the operator namespace that l8k did not render — excludes service
    CRs like NicDevice/SriovNetworkNodeState/SriovOperatorConfig).
    Without --overwrite-existing: fail fast listing every mismatch.
    With it: helm upgrade the chart, delete strays, let Phase 1 SSA
    overwrite component versions.
  * `l8k validate` consumes the same preflight checks and renders them
    into the HTML report's "Network Operator release" section. Soft
    fails — contributes to the verdict but never gates execution.

How

  * New `pkg/networkoperatorplugin/preflight/` is one source of truth
    for the four checks. Both deploy (gating) and validate (read-only)
    consume the same `Result` shape.
  * New `pkg/networkoperatorplugin/helmclient/` extracts the
    action.Configuration wiring so deploy's Phase 0 and the preflight
    helm checks share one kube-client setup.
  * New `pkg/cmd/userconfig.go` centralises cluster-config.yaml
    discovery + catalog application; generate, deploy, and validate
    now use the same lookup chain (`--user-config` >
    ./cluster-config.yaml > <deployment-files>/../cluster-config.yaml
    > <deployment-files>/cluster-config.yaml > ./l8k-config.yaml >
    <share-dir>/l8k-config.yaml). Standalone `l8k deploy` previously
    didn't load the user config at all, which made every preflight
    check soft-skip.
  * Release catalog adds `helmRepoURL` and `operatorRepository` per
    release line. The operator binary image lives at
    nvcr.io/nvidia/cloud-native for stable releases vs. components at
    nvcr.io/nvidia/mellanox; staging keeps both under
    nvcr.io/nvstaging/mellanox.
  * `--overwrite-existing` flag on both `l8k deploy` and
    `l8k generate --deploy`.
  * Stuck-release detection: helm refuses every operation when the
    release is in pending-install/pending-upgrade/pending-rollback
    (typically a crashed previous run). We catch this upfront and
    point at `helm rollback`/`helm uninstall` instead of letting
    helm's mid-flight error string surface.
  * sriov-enabled profiles set `sriov-network-operator.sriovOperatorConfig.deploy: true`
    — the upstream subchart defaults to false, which leaves the
    "default" SriovOperatorConfig CR missing and stalls every
    SriovNetworkNodePolicy reconcile.
  * Klog + client-go HTTP-299 warning handler now routed through
    controller-runtime's logger at V(2): silent at info, visible at
    --log-level=debug. Stops `I0524 ... warnings.go:110]` noise from
    leaking to stderr.
  * Structured errors carry actionable messages plus a
    `Suggestion:` line printed by exitWithError. No more `refusing`
    phrasing; no double-wrapping of structured errors.

Docs

  * Per the repo's documentation discipline: CLAUDE.md gains
    "Helm chart install (Phase 0)" section; README.md, l8k-config.yaml
    sample, and the deploy/generate/config skills updated.

Tests

  * 23 new tests in pkg/networkoperatorplugin/preflight/ covering each
    check + the runner + Remediate.
  * 3 new tests in pkg/networkoperatorplugin/ for the chart-version
    gate and the stuck-release detection (in-memory helm storage
    driver, no real cluster needed).
  * Existing ApplyOptionsToConfig tests extended for the new
    HelmRepoURL / OperatorRepository fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
@almaslennikov almaslennikov merged commit bc63a20 into main May 24, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant