OCPBUGS-85237: Manually uncordoned nodes are not automatically re-cordoned by djoshy · Pull Request #6028 · openshift/machine-config-operator

djoshy · 2026-05-12T17:20:10Z

- What I did

Updated handleNodeEvent filter to queue nodes that have a change in cordon states and not just the drain annotations
syncNode will now ensure that the node is still cordoned at the end of a drain, so that we do not apply the completion annotation for a node that was externally uncordoned.
Added a new argument in the drain helper init, this is to bring parity with behavior before the rebase. See bug for additional details .
Test changes for the new behavior and some additional logging.

- How to verify it
Bug description has reproduction steps, but very simply, apply an MC change to a pool, and uncordon the node while it is being drain via oc:

$ oc adm uncordon NODE_NAME

The MCC won't immediately cordon the node as it is currently being processed in the drain queue. Once the drain completes, it will check if the node was uncordoned for some reason and skip applying the completion annotation:

I0512 14:15:03.051712       1 drain_controller.go:382] node djoshy-dev-104-x5n4x-worker-c-dpbxz: externally uncordoned during drain, skipping completion annotation

OTOH, if the drain fails due to a timeout/PDVB, we'd revert to our old fix path where the node is automatically queued for another drain event.

I0512 14:17:32.937230       1 drain_controller.go:192] node djoshy-dev-104-x5n4x-master-2: Drain failed. Waiting 1 minute then retrying. Error message from drain: [error when waiting for pod "apiserver-767df47bfb-tlxhd" in namespace "openshift-apiserver" to terminate: context deadline exceeded, error when waiting for pod "metrics-server-f64bd79d4-lbwjj" in namespace "openshift-monitoring" to terminate: context deadline exceeded]

In both cases, the node will be cordoned again if it is not:

I0512 14:17:32.985942       1 drain_controller.go:417] Previous node drain found. Drain has been going on for 0.025946378425833333 hours
I0512 14:17:32.986014       1 drain_controller.go:436] External actor has unexpectedly uncordoned node djoshy-dev-104-x5n4x-master-2, cordoning again...

Summary by CodeRabbit

Bug Fixes
- Fixed drain completion detection to prevent false completion signals when nodes are externally uncordoned during the drain process.
- Enhanced drain trigger detection to monitor both drainer status and node schedulability changes.
- Improved retry handling for eviction errors during drain operations.
Tests
- Added test coverage for external uncordoning scenarios during drain operations.

openshift-merge-bot · 2026-05-12T17:20:13Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-05-12T17:20:17Z

@djoshy: This pull request references Jira Issue OCPBUGS-85237, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

- What I did

Updated handleNodeEvent filter to queue nodes that have a change in cordon states and not just the drain annotations

syncNode will now ensure that the node is still cordoned at the end of a drain, so that we do not apply the completion annotation for a node that was externally uncordoned.

Added a new argument in the drain helper init, this is to bring parity with behavior before the rebase. See bug for additional details .

Test changes for the new behavior and some additional logging.

- How to verify it
Bug description has reproduction steps, but very simply, apply an MC change to a pool, and uncordon the node while it is being drain via oc:
$ oc adm uncordon NODE_NAME
The MCC won't immediately cordon the node as it is currently being processed in the drain queue. Once the drain completes, it will check if the node was uncordoned for some reason and skip applying the completion annotation:
I0512 14:15:03.051712       1 drain_controller.go:382] node djoshy-dev-104-x5n4x-worker-c-dpbxz: externally uncordoned during drain, skipping completion annotation
OTOH, if the drain fails due to a timeout/PDVB, we'd revert to our old fix path where the node is automatically queued for another drain event.
I0512 14:17:32.937230       1 drain_controller.go:192] node djoshy-dev-104-x5n4x-master-2: Drain failed. Waiting 1 minute then retrying. Error message from drain: [error when waiting for pod "apiserver-767df47bfb-tlxhd" in namespace "openshift-apiserver" to terminate: context deadline exceeded, error when waiting for pod "metrics-server-f64bd79d4-lbwjj" in namespace "openshift-monitoring" to terminate: context deadline exceeded]
In both cases, the node will be cordoned again if it is not:
I0512 14:17:32.985942       1 drain_controller.go:417] Previous node drain found. Drain has been going on for 0.025946378425833333 hours
I0512 14:17:32.986014       1 drain_controller.go:436] External actor has unexpectedly uncordoned node djoshy-dev-104-x5n4x-master-2, cordoning again...

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-12T17:20:26Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1259ccdf-3b07-4cb6-a79d-6cac97122bd0

📥 Commits

Reviewing files that changed from the base of the PR and between 99cb8a4 and 25d9c6d.

📒 Files selected for processing (2)

pkg/controller/drain/drain_controller.go
pkg/controller/drain/drain_controller_test.go

Walkthrough

The drain controller now detects and handles the case where a node is externally uncordoned (Spec.Unschedulable transitions to false) during the drain process. After draining, it re-fetches the node and skips writing the completion annotation if it discovers the node was uncordoned externally, preventing false drain-completion signaling.

Changes

Drain External Uncordon Protection

Layer / File(s)	Summary
Event enqueue triggering and desiredVerb computation `pkg/controller/drain/drain_controller.go` (lines 209–215, 296–312, 330–332)	`handleNodeEvent` now enqueues when either the desired drainer annotation or `Spec.Unschedulable` changes, enabling detection of external uncordon events. `syncNode` computes `desiredVerb` earlier in the flow and adds `EvictErrorRetryDelay: 5 * time.Second` to the drain helper configuration.
Drain completion annotation conditional logic `pkg/controller/drain/drain_controller.go` (lines 372–384)	After a successful `drainNode` call, `syncNode` re-fetches the Node and conditionally skips writing `LastAppliedDrainerAnnotationKey` if the node is found to be schedulable (`Spec.Unschedulable == false`), indicating external uncordon; otherwise normal completion annotation update proceeds.
Test harness and external uncordon simulation `pkg/controller/drain/drain_controller_test.go` (lines 14–14, 75–75, 120–120, 135–135, 145–197)	`createTestController` now returns the node informer as an additional value. New test helpers (`updateNodeInIndexer`, `setupControllerAndSyncWithIndexerUpdate`, `setupControllerAndSyncWithExternalUncordon`) mirror cordon patches into the informer and simulate external node uncordon during drain execution.
Externally uncordoned drain test case `pkg/controller/drain/drain_controller_test.go` (lines 249–277)	The "drain requested" subtest uses the indexer-updating reactor, and a new "externally uncordoned during drain" subtest verifies that when a node becomes schedulable before re-fetch, only the cordon patch is written and no completion annotation is applied.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly references OCPBUGS-85237 and clearly summarizes the main change: ensuring manually uncordoned nodes are automatically re-cordoned during drain operations.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in the modified test file are stable and deterministic. No dynamic values (timestamps, UUIDs, node names, random suffixes) are embedded in test names.
Test Structure And Quality	✅ Passed	Tests use standard Go testing, not Ginkgo. Meet quality standards: single responsibility, meaningful assertion messages, consistent codebase patterns, proper fake client usage.
Microshift Test Compatibility	✅ Passed	No Ginkgo e2e tests added. PR contains only unit tests using standard Go testing in pkg/controller/drain, not e2e tests. Check not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo e2e tests added. PR modifies only a unit test file (pkg/controller/drain/drain_controller_test.go) using standard Go testing package, not Ginkgo. Check not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies drain controller logic only. No scheduling constraints, affinity rules, topology constraints, or replica-count logic affecting workload distribution were introduced.
Ote Binary Stdout Contract	✅ Passed	PR modifies controller library code and standard Go unit tests, not OTE test binaries. No process-level stdout writes detected. Check does not apply.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This PR does not add Ginkgo e2e tests. It modifies controller logic and adds standard Go unit tests (TestSyncNode with t.Run subtests). No Ginkgo patterns (It, Describe, Context, When) found.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-12T17:20:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [djoshy]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-05-12T17:28:05Z

@djoshy: This pull request references Jira Issue OCPBUGS-85237, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

- What I did

Updated handleNodeEvent filter to queue nodes that have a change in cordon states and not just the drain annotations

syncNode will now ensure that the node is still cordoned at the end of a drain, so that we do not apply the completion annotation for a node that was externally uncordoned.

Added a new argument in the drain helper init, this is to bring parity with behavior before the rebase. See bug for additional details .

Test changes for the new behavior and some additional logging.

- How to verify it
Bug description has reproduction steps, but very simply, apply an MC change to a pool, and uncordon the node while it is being drain via oc:
$ oc adm uncordon NODE_NAME
The MCC won't immediately cordon the node as it is currently being processed in the drain queue. Once the drain completes, it will check if the node was uncordoned for some reason and skip applying the completion annotation:
I0512 14:15:03.051712       1 drain_controller.go:382] node djoshy-dev-104-x5n4x-worker-c-dpbxz: externally uncordoned during drain, skipping completion annotation
OTOH, if the drain fails due to a timeout/PDVB, we'd revert to our old fix path where the node is automatically queued for another drain event.
I0512 14:17:32.937230       1 drain_controller.go:192] node djoshy-dev-104-x5n4x-master-2: Drain failed. Waiting 1 minute then retrying. Error message from drain: [error when waiting for pod "apiserver-767df47bfb-tlxhd" in namespace "openshift-apiserver" to terminate: context deadline exceeded, error when waiting for pod "metrics-server-f64bd79d4-lbwjj" in namespace "openshift-monitoring" to terminate: context deadline exceeded]
In both cases, the node will be cordoned again if it is not:
I0512 14:17:32.985942       1 drain_controller.go:417] Previous node drain found. Drain has been going on for 0.025946378425833333 hours
I0512 14:17:32.986014       1 drain_controller.go:436] External actor has unexpectedly uncordoned node djoshy-dev-104-x5n4x-master-2, cordoning again...
Summary by CodeRabbit

Bug Fixes

Fixed drain completion detection to prevent false completion signals when nodes are externally uncordoned during the drain process.

Enhanced drain trigger detection to monitor both drainer status and node schedulability changes.

Improved retry handling for eviction errors during drain operations.

Tests

Added test coverage for external uncordoning scenarios during drain operations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

djoshy · 2026-05-12T17:28:37Z

/cherry-pick release-4.22

openshift-cherrypick-robot · 2026-05-12T17:28:40Z

@djoshy: once the present PR merges, I will cherry-pick it on top of release-4.22 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2026-05-12T17:40:05Z

@djoshy: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

drain: fix re-cordon detection

25d9c6d

openshift-ci Bot requested review from pablintino and yuqi-zhang May 12, 2026 17:20

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-85237: Manually uncordoned nodes are not automatically re-cordoned#6028

OCPBUGS-85237: Manually uncordoned nodes are not automatically re-cordoned#6028
djoshy wants to merge 1 commit into
openshift:mainfrom
djoshy:fix-drain

djoshy commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-merge-bot Bot commented May 12, 2026

Uh oh!

openshift-ci-robot commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 12, 2026

Uh oh!

openshift-ci-robot commented May 12, 2026

Summary by CodeRabbit

Uh oh!

djoshy commented May 12, 2026

Uh oh!

openshift-cherrypick-robot commented May 12, 2026

Uh oh!

openshift-ci Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

djoshy commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 12, 2026

Uh oh!

openshift-ci-robot commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 12, 2026

Uh oh!

openshift-ci-robot commented May 12, 2026

Summary by CodeRabbit

Uh oh!

djoshy commented May 12, 2026

Uh oh!

openshift-cherrypick-robot commented May 12, 2026

Uh oh!

openshift-ci Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

djoshy commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading