feat: day2 check for metax#35
Conversation
izturn
commented
May 27, 2026
- day2 check for metax
- refactor bridge
- preflight test passed
There was a problem hiding this comment.
Code Review
This pull request introduces several enhancements and robustness fixes across the codebase. Key changes include upgrading the mx-smi image to v0.2 with IB verbs support, migrating the day2 check schedule from an integer hour to a string-based time format ("HH:MM"), integrating a Kubernetes client to skip day2 checks on nodes without MetaX GPU capacity, and wrapping node taint updates with conflict retries. Additionally, the event bridge now utilizes a rate-limiting workqueue to handle event forwarding asynchronously, and the preflight slow node aggregator has been refactored to reject odd workload sizes and lift the maximum batch limit. The review feedback highlights a potential deadlock in the event bridge during shutdown, a missing timeout on a Kubernetes API call, and a Daylight Saving Time (DST) scheduling bug when advancing the day2 check time.
| return false, fmt.Errorf("node name is empty") | ||
| } | ||
|
|
||
| node, err := d.client.CoreV1().Nodes().Get(context.Background(), d.config.NodeName, metav1.GetOptions{}) |
There was a problem hiding this comment.
Using context.Background() without a timeout for Kubernetes API calls can block indefinitely if the API server is unresponsive or if there are network issues. It is highly recommended to use a context with a reasonable timeout (e.g., 10 seconds) to ensure the background check goroutine does not hang.
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
node, err := d.client.CoreV1().Nodes().Get(ctx, d.config.NodeName, metav1.GetOptions{})There was a problem hiding this comment.
Pull request overview
This PR extends the MetaX node detector with a “day2” capability gate (based on node GPU capacity) and a configurable daily check time, refactors the Kubernetes event bridge to avoid dropping events by introducing a workqueue + update handling, and updates preflight slow-node parsing/aggregation to better handle skipped batches (plus various build/release plumbing updates).
Changes:
- MetaX detector: add K8s-capacity-based enablement, configurable
day2CheckTimeparsing, and wire K8s client through the node detector factory. - Event bridge: add update handler and replace “drop when channel full” behavior with a typed rate-limiting workqueue plus draining on shutdown.
- Preflight aggregator: refactor slow-node detection, treat
skipbatches as abnormal signals, and expand batch planning logic; add/extend tests and examples.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Bumps referenced mx-smi image tag to v0.2 and updates manual build examples. |
| pkg/preflight/slow_node_aggregator.go | Refactors slow-node detection, adds skip-batch handling, changes workload plan rules, adds more logging/helpers. |
| pkg/preflight/slow_node_aggregator_test.go | Adds examples/tests for skip reports and odd workload-size rejection; updates batch planning in helpers. |
| pkg/kube/kube.go | Makes node tainting resilient to update conflicts via RetryOnConflict. |
| pkg/kube/kube_test.go | Adds conflict-retry unit test for node tainting. |
| pkg/events/kubeevent_test.go | Adds/updates tests for day2 reason preservation, queueing behavior, update folding, and shutdown semantics. |
| pkg/events/kubeevent_bridge.go | Introduces typed workqueue + update handler; changes forwarding and shutdown behavior. |
| pkg/detector/node/node.go | Wires a K8s client into MetaX detector construction. |
| pkg/detector/node/node_test.go | Updates tests for new detector factory signature and MetaX client requirement. |
| pkg/detector/node/metax/metax.go | Adds MetaX day2 capability check via Node capacity, new schedule parsing (HH:MM), and more check logging. |
| pkg/detector/node/metax/metax_test.go | Adds tests for schedule parsing and capability gating behavior. |
| manifests/kcover/values.yaml | Pins agent/controller image tags to v0.0.8 instead of latest. |
| manifests/kcover/Chart.yaml | Updates chart/app versions to 0.0.8. |
| Makefile | Splits controller/agent push platforms, updates mx-smi image version, and adjusts build commands. |
| docker/mx-smi.Dockerfile | Installs ibverbs-related libs and includes ibv_devinfo in the extracted image. |
| docker/agent.Dockerfile | Installs ibverbs libs and copies ibv_devinfo into the agent image; bumps default mx-smi image tag. |
| cmd/kcover/main.go | Reorders shutdown to stop bridge before recovery controller. |
| cmd/agent/main.go | Passes the K8s client into the node detector factory. |
| cmd/agent/config/load_test.go | Updates config tests to validate Day2CheckTime defaults/repairs. |
| cmd/agent/config/config.go | Replaces Day2CheckHour with Day2CheckTime and adds validation. |
| .github/workflows/ci-release.yml | Pins agent build to linux/amd64 and updates mx-smi image build arg to v0.2. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.