Skip to content

feat: day2 check for metax#35

Open
izturn wants to merge 6 commits into
BaizeAI:mainfrom
izturn:feat/metax-day2
Open

feat: day2 check for metax#35
izturn wants to merge 6 commits into
BaizeAI:mainfrom
izturn:feat/metax-day2

Conversation

@izturn
Copy link
Copy Markdown
Contributor

@izturn izturn commented May 27, 2026

  1. day2 check for metax
  2. refactor bridge
  3. preflight test passed

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several enhancements and robustness fixes across the codebase. Key changes include upgrading the mx-smi image to v0.2 with IB verbs support, migrating the day2 check schedule from an integer hour to a string-based time format ("HH:MM"), integrating a Kubernetes client to skip day2 checks on nodes without MetaX GPU capacity, and wrapping node taint updates with conflict retries. Additionally, the event bridge now utilizes a rate-limiting workqueue to handle event forwarding asynchronously, and the preflight slow node aggregator has been refactored to reject odd workload sizes and lift the maximum batch limit. The review feedback highlights a potential deadlock in the event bridge during shutdown, a missing timeout on a Kubernetes API call, and a Daylight Saving Time (DST) scheduling bug when advancing the day2 check time.

Comment thread pkg/events/kubeevent_bridge.go
return false, fmt.Errorf("node name is empty")
}

node, err := d.client.CoreV1().Nodes().Get(context.Background(), d.config.NodeName, metav1.GetOptions{})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using context.Background() without a timeout for Kubernetes API calls can block indefinitely if the API server is unresponsive or if there are network issues. It is highly recommended to use a context with a reasonable timeout (e.g., 10 seconds) to ensure the background check goroutine does not hang.

	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
	defer cancel()
	node, err := d.client.CoreV1().Nodes().Get(ctx, d.config.NodeName, metav1.GetOptions{})

Comment thread pkg/detector/node/metax/metax.go
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the MetaX node detector with a “day2” capability gate (based on node GPU capacity) and a configurable daily check time, refactors the Kubernetes event bridge to avoid dropping events by introducing a workqueue + update handling, and updates preflight slow-node parsing/aggregation to better handle skipped batches (plus various build/release plumbing updates).

Changes:

  • MetaX detector: add K8s-capacity-based enablement, configurable day2CheckTime parsing, and wire K8s client through the node detector factory.
  • Event bridge: add update handler and replace “drop when channel full” behavior with a typed rate-limiting workqueue plus draining on shutdown.
  • Preflight aggregator: refactor slow-node detection, treat skip batches as abnormal signals, and expand batch planning logic; add/extend tests and examples.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
README.md Bumps referenced mx-smi image tag to v0.2 and updates manual build examples.
pkg/preflight/slow_node_aggregator.go Refactors slow-node detection, adds skip-batch handling, changes workload plan rules, adds more logging/helpers.
pkg/preflight/slow_node_aggregator_test.go Adds examples/tests for skip reports and odd workload-size rejection; updates batch planning in helpers.
pkg/kube/kube.go Makes node tainting resilient to update conflicts via RetryOnConflict.
pkg/kube/kube_test.go Adds conflict-retry unit test for node tainting.
pkg/events/kubeevent_test.go Adds/updates tests for day2 reason preservation, queueing behavior, update folding, and shutdown semantics.
pkg/events/kubeevent_bridge.go Introduces typed workqueue + update handler; changes forwarding and shutdown behavior.
pkg/detector/node/node.go Wires a K8s client into MetaX detector construction.
pkg/detector/node/node_test.go Updates tests for new detector factory signature and MetaX client requirement.
pkg/detector/node/metax/metax.go Adds MetaX day2 capability check via Node capacity, new schedule parsing (HH:MM), and more check logging.
pkg/detector/node/metax/metax_test.go Adds tests for schedule parsing and capability gating behavior.
manifests/kcover/values.yaml Pins agent/controller image tags to v0.0.8 instead of latest.
manifests/kcover/Chart.yaml Updates chart/app versions to 0.0.8.
Makefile Splits controller/agent push platforms, updates mx-smi image version, and adjusts build commands.
docker/mx-smi.Dockerfile Installs ibverbs-related libs and includes ibv_devinfo in the extracted image.
docker/agent.Dockerfile Installs ibverbs libs and copies ibv_devinfo into the agent image; bumps default mx-smi image tag.
cmd/kcover/main.go Reorders shutdown to stop bridge before recovery controller.
cmd/agent/main.go Passes the K8s client into the node detector factory.
cmd/agent/config/load_test.go Updates config tests to validate Day2CheckTime defaults/repairs.
cmd/agent/config/config.go Replaces Day2CheckHour with Day2CheckTime and adds validation.
.github/workflows/ci-release.yml Pins agent build to linux/amd64 and updates mx-smi image build arg to v0.2.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/detector/node/metax/metax.go
Comment thread pkg/events/kubeevent_bridge.go
Comment thread pkg/events/kubeevent_bridge.go
Comment thread pkg/preflight/slow_node_aggregator.go
Comment thread pkg/preflight/slow_node_aggregator.go
Comment thread manifests/kcover/Chart.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants