deploy: install Network Operator helm chart in-process with preflight drift checks#69
Merged
Merged
Conversation
… drift checks
Folds the manual `helm install nvidia/network-operator` prerequisite into
`l8k deploy` so users no longer have to run helm by hand. Adds a preflight
phase that detects and (with --overwrite-existing) remediates cluster drift
across four signals shared between deploy gating and validate reporting.
What
* `l8k generate` renders a per-profile `values.yaml` next to the CR
manifests (new `00-values.yaml` templates in each profile). Chart
version, repo URL, operator and component image repositories all
come from the embedded release catalog.
* `l8k deploy` Phase 0 installs/upgrades the network-operator chart
via the Helm Go SDK (`helm.sh/helm/v3`). No shell-out, no
`helm repo add` side-effects — fetches the chart tgz into a temp
dir per run.
* Phase 0.5 runs the new `preflight` package: chart-version diff,
user-values diff, NicClusterPolicy/NicNodePolicy component-version
diff, and stray-CR detection (Network Operator-managed Kinds in
the operator namespace that l8k did not render — excludes service
CRs like NicDevice/SriovNetworkNodeState/SriovOperatorConfig).
Without --overwrite-existing: fail fast listing every mismatch.
With it: helm upgrade the chart, delete strays, let Phase 1 SSA
overwrite component versions.
* `l8k validate` consumes the same preflight checks and renders them
into the HTML report's "Network Operator release" section. Soft
fails — contributes to the verdict but never gates execution.
How
* New `pkg/networkoperatorplugin/preflight/` is one source of truth
for the four checks. Both deploy (gating) and validate (read-only)
consume the same `Result` shape.
* New `pkg/networkoperatorplugin/helmclient/` extracts the
action.Configuration wiring so deploy's Phase 0 and the preflight
helm checks share one kube-client setup.
* New `pkg/cmd/userconfig.go` centralises cluster-config.yaml
discovery + catalog application; generate, deploy, and validate
now use the same lookup chain (`--user-config` >
./cluster-config.yaml > <deployment-files>/../cluster-config.yaml
> <deployment-files>/cluster-config.yaml > ./l8k-config.yaml >
<share-dir>/l8k-config.yaml). Standalone `l8k deploy` previously
didn't load the user config at all, which made every preflight
check soft-skip.
* Release catalog adds `helmRepoURL` and `operatorRepository` per
release line. The operator binary image lives at
nvcr.io/nvidia/cloud-native for stable releases vs. components at
nvcr.io/nvidia/mellanox; staging keeps both under
nvcr.io/nvstaging/mellanox.
* `--overwrite-existing` flag on both `l8k deploy` and
`l8k generate --deploy`.
* Stuck-release detection: helm refuses every operation when the
release is in pending-install/pending-upgrade/pending-rollback
(typically a crashed previous run). We catch this upfront and
point at `helm rollback`/`helm uninstall` instead of letting
helm's mid-flight error string surface.
* sriov-enabled profiles set `sriov-network-operator.sriovOperatorConfig.deploy: true`
— the upstream subchart defaults to false, which leaves the
"default" SriovOperatorConfig CR missing and stalls every
SriovNetworkNodePolicy reconcile.
* Klog + client-go HTTP-299 warning handler now routed through
controller-runtime's logger at V(2): silent at info, visible at
--log-level=debug. Stops `I0524 ... warnings.go:110]` noise from
leaking to stderr.
* Structured errors carry actionable messages plus a
`Suggestion:` line printed by exitWithError. No more `refusing`
phrasing; no double-wrapping of structured errors.
Docs
* Per the repo's documentation discipline: CLAUDE.md gains
"Helm chart install (Phase 0)" section; README.md, l8k-config.yaml
sample, and the deploy/generate/config skills updated.
Tests
* 23 new tests in pkg/networkoperatorplugin/preflight/ covering each
check + the runner + Remediate.
* 3 new tests in pkg/networkoperatorplugin/ for the chart-version
gate and the stuck-release detection (in-memory helm storage
driver, no real cluster needed).
* Existing ApplyOptionsToConfig tests extended for the new
HelmRepoURL / OperatorRepository fields.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Folds the manual
helm install nvidia/network-operatorprerequisite intol8k deployso users no longer have to run helm by hand. Adds a preflight phase that detects and (with --overwrite-existing) remediates cluster drift across four signals shared between deploy gating and validate reporting.What
l8k generaterenders a per-profilevalues.yamlnext to the CR manifests (new00-values.yamltemplates in each profile). Chart version, repo URL, operator and component image repositories all come from the embedded release catalog.l8k deployPhase 0 installs/upgrades the network-operator chart via the Helm Go SDK (helm.sh/helm/v3). No shell-out, nohelm repo addside-effects — fetches the chart tgz into a temp dir per run.preflightpackage: chart-version diff, user-values diff, NicClusterPolicy/NicNodePolicy component-version diff, and stray-CR detection (Network Operator-managed Kinds in the operator namespace that l8k did not render — excludes service CRs like NicDevice/SriovNetworkNodeState/SriovOperatorConfig). Without --overwrite-existing: fail fast listing every mismatch. With it: helm upgrade the chart, delete strays, let Phase 1 SSA overwrite component versions.l8k validateconsumes the same preflight checks and renders them into the HTML report's "Network Operator release" section. Soft fails — contributes to the verdict but never gates execution.How
pkg/networkoperatorplugin/preflight/is one source of truth for the four checks. Both deploy (gating) and validate (read-only) consume the sameResultshape.pkg/networkoperatorplugin/helmclient/extracts the action.Configuration wiring so deploy's Phase 0 and the preflight helm checks share one kube-client setup.pkg/cmd/userconfig.gocentralises cluster-config.yaml discovery + catalog application; generate, deploy, and validate now use the same lookup chain (--user-config> ./cluster-config.yaml > /../cluster-config.yamlhelmRepoURLandoperatorRepositoryper release line. The operator binary image lives at nvcr.io/nvidia/cloud-native for stable releases vs. components at nvcr.io/nvidia/mellanox; staging keeps both under nvcr.io/nvstaging/mellanox.--overwrite-existingflag on bothl8k deployandl8k generate --deploy.helm rollback/helm uninstallinstead of letting helm's mid-flight error string surface.sriov-network-operator.sriovOperatorConfig.deploy: true— the upstream subchart defaults to false, which leaves the "default" SriovOperatorConfig CR missing and stalls every SriovNetworkNodePolicy reconcile.I0524 ... warnings.go:110]noise from leaking to stderr.Suggestion:line printed by exitWithError. No morerefusingphrasing; no double-wrapping of structured errors.Docs
Tests