Skip to content

OCPBUGS-85237: Manually uncordoned nodes are not automatically re-cordoned#6028

Open
djoshy wants to merge 1 commit into
openshift:mainfrom
djoshy:fix-drain
Open

OCPBUGS-85237: Manually uncordoned nodes are not automatically re-cordoned#6028
djoshy wants to merge 1 commit into
openshift:mainfrom
djoshy:fix-drain

Conversation

@djoshy
Copy link
Copy Markdown
Contributor

@djoshy djoshy commented May 12, 2026

- What I did

  • Updated handleNodeEvent filter to queue nodes that have a change in cordon states and not just the drain annotations
  • syncNode will now ensure that the node is still cordoned at the end of a drain, so that we do not apply the completion annotation for a node that was externally uncordoned.
  • Added a new argument in the drain helper init, this is to bring parity with behavior before the rebase. See bug for additional details .
  • Test changes for the new behavior and some additional logging.

- How to verify it
Bug description has reproduction steps, but very simply, apply an MC change to a pool, and uncordon the node while it is being drain via oc:

$ oc adm uncordon NODE_NAME

The MCC won't immediately cordon the node as it is currently being processed in the drain queue. Once the drain completes, it will check if the node was uncordoned for some reason and skip applying the completion annotation:

I0512 14:15:03.051712       1 drain_controller.go:382] node djoshy-dev-104-x5n4x-worker-c-dpbxz: externally uncordoned during drain, skipping completion annotation

OTOH, if the drain fails due to a timeout/PDVB, we'd revert to our old fix path where the node is automatically queued for another drain event.

I0512 14:17:32.937230       1 drain_controller.go:192] node djoshy-dev-104-x5n4x-master-2: Drain failed. Waiting 1 minute then retrying. Error message from drain: [error when waiting for pod "apiserver-767df47bfb-tlxhd" in namespace "openshift-apiserver" to terminate: context deadline exceeded, error when waiting for pod "metrics-server-f64bd79d4-lbwjj" in namespace "openshift-monitoring" to terminate: context deadline exceeded]

In both cases, the node will be cordoned again if it is not:

I0512 14:17:32.985942       1 drain_controller.go:417] Previous node drain found. Drain has been going on for 0.025946378425833333 hours
I0512 14:17:32.986014       1 drain_controller.go:436] External actor has unexpectedly uncordoned node djoshy-dev-104-x5n4x-master-2, cordoning again...

Summary by CodeRabbit

  • Bug Fixes

    • Fixed drain completion detection to prevent false completion signals when nodes are externally uncordoned during the drain process.
    • Enhanced drain trigger detection to monitor both drainer status and node schedulability changes.
    • Improved retry handling for eviction errors during drain operations.
  • Tests

    • Added test coverage for external uncordoning scenarios during drain operations.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-85237, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

- What I did

  • Updated handleNodeEvent filter to queue nodes that have a change in cordon states and not just the drain annotations
  • syncNode will now ensure that the node is still cordoned at the end of a drain, so that we do not apply the completion annotation for a node that was externally uncordoned.
  • Added a new argument in the drain helper init, this is to bring parity with behavior before the rebase. See bug for additional details .
  • Test changes for the new behavior and some additional logging.

- How to verify it
Bug description has reproduction steps, but very simply, apply an MC change to a pool, and uncordon the node while it is being drain via oc:

$ oc adm uncordon NODE_NAME

The MCC won't immediately cordon the node as it is currently being processed in the drain queue. Once the drain completes, it will check if the node was uncordoned for some reason and skip applying the completion annotation:

I0512 14:15:03.051712       1 drain_controller.go:382] node djoshy-dev-104-x5n4x-worker-c-dpbxz: externally uncordoned during drain, skipping completion annotation

OTOH, if the drain fails due to a timeout/PDVB, we'd revert to our old fix path where the node is automatically queued for another drain event.

I0512 14:17:32.937230       1 drain_controller.go:192] node djoshy-dev-104-x5n4x-master-2: Drain failed. Waiting 1 minute then retrying. Error message from drain: [error when waiting for pod "apiserver-767df47bfb-tlxhd" in namespace "openshift-apiserver" to terminate: context deadline exceeded, error when waiting for pod "metrics-server-f64bd79d4-lbwjj" in namespace "openshift-monitoring" to terminate: context deadline exceeded]

In both cases, the node will be cordoned again if it is not:

I0512 14:17:32.985942       1 drain_controller.go:417] Previous node drain found. Drain has been going on for 0.025946378425833333 hours
I0512 14:17:32.986014       1 drain_controller.go:436] External actor has unexpectedly uncordoned node djoshy-dev-104-x5n4x-master-2, cordoning again...

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1259ccdf-3b07-4cb6-a79d-6cac97122bd0

📥 Commits

Reviewing files that changed from the base of the PR and between 99cb8a4 and 25d9c6d.

📒 Files selected for processing (2)
  • pkg/controller/drain/drain_controller.go
  • pkg/controller/drain/drain_controller_test.go

Walkthrough

The drain controller now detects and handles the case where a node is externally uncordoned (Spec.Unschedulable transitions to false) during the drain process. After draining, it re-fetches the node and skips writing the completion annotation if it discovers the node was uncordoned externally, preventing false drain-completion signaling.

Changes

Drain External Uncordon Protection

Layer / File(s) Summary
Event enqueue triggering and desiredVerb computation
pkg/controller/drain/drain_controller.go (lines 209–215, 296–312, 330–332)
handleNodeEvent now enqueues when either the desired drainer annotation or Spec.Unschedulable changes, enabling detection of external uncordon events. syncNode computes desiredVerb earlier in the flow and adds EvictErrorRetryDelay: 5 * time.Second to the drain helper configuration.
Drain completion annotation conditional logic
pkg/controller/drain/drain_controller.go (lines 372–384)
After a successful drainNode call, syncNode re-fetches the Node and conditionally skips writing LastAppliedDrainerAnnotationKey if the node is found to be schedulable (Spec.Unschedulable == false), indicating external uncordon; otherwise normal completion annotation update proceeds.
Test harness and external uncordon simulation
pkg/controller/drain/drain_controller_test.go (lines 14–14, 75–75, 120–120, 135–135, 145–197)
createTestController now returns the node informer as an additional value. New test helpers (updateNodeInIndexer, setupControllerAndSyncWithIndexerUpdate, setupControllerAndSyncWithExternalUncordon) mirror cordon patches into the informer and simulate external node uncordon during drain execution.
Externally uncordoned drain test case
pkg/controller/drain/drain_controller_test.go (lines 249–277)
The "drain requested" subtest uses the indexer-updating reactor, and a new "externally uncordoned during drain" subtest verifies that when a node becomes schedulable before re-fetch, only the cordon patch is written and no completion annotation is applied.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly references OCPBUGS-85237 and clearly summarizes the main change: ensuring manually uncordoned nodes are automatically re-cordoned during drain operations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the modified test file are stable and deterministic. No dynamic values (timestamps, UUIDs, node names, random suffixes) are embedded in test names.
Test Structure And Quality ✅ Passed Tests use standard Go testing, not Ginkgo. Meet quality standards: single responsibility, meaningful assertion messages, consistent codebase patterns, proper fake client usage.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR contains only unit tests using standard Go testing in pkg/controller/drain, not e2e tests. Check not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests added. PR modifies only a unit test file (pkg/controller/drain/drain_controller_test.go) using standard Go testing package, not Ginkgo. Check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies drain controller logic only. No scheduling constraints, affinity rules, topology constraints, or replica-count logic affecting workload distribution were introduced.
Ote Binary Stdout Contract ✅ Passed PR modifies controller library code and standard Go unit tests, not OTE test binaries. No process-level stdout writes detected. Check does not apply.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add Ginkgo e2e tests. It modifies controller logic and adds standard Go unit tests (TestSyncNode with t.Run subtests). No Ginkgo patterns (It, Describe, Context, When) found.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from pablintino and yuqi-zhang May 12, 2026 17:20
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-85237, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

- What I did

  • Updated handleNodeEvent filter to queue nodes that have a change in cordon states and not just the drain annotations
  • syncNode will now ensure that the node is still cordoned at the end of a drain, so that we do not apply the completion annotation for a node that was externally uncordoned.
  • Added a new argument in the drain helper init, this is to bring parity with behavior before the rebase. See bug for additional details .
  • Test changes for the new behavior and some additional logging.

- How to verify it
Bug description has reproduction steps, but very simply, apply an MC change to a pool, and uncordon the node while it is being drain via oc:

$ oc adm uncordon NODE_NAME

The MCC won't immediately cordon the node as it is currently being processed in the drain queue. Once the drain completes, it will check if the node was uncordoned for some reason and skip applying the completion annotation:

I0512 14:15:03.051712       1 drain_controller.go:382] node djoshy-dev-104-x5n4x-worker-c-dpbxz: externally uncordoned during drain, skipping completion annotation

OTOH, if the drain fails due to a timeout/PDVB, we'd revert to our old fix path where the node is automatically queued for another drain event.

I0512 14:17:32.937230       1 drain_controller.go:192] node djoshy-dev-104-x5n4x-master-2: Drain failed. Waiting 1 minute then retrying. Error message from drain: [error when waiting for pod "apiserver-767df47bfb-tlxhd" in namespace "openshift-apiserver" to terminate: context deadline exceeded, error when waiting for pod "metrics-server-f64bd79d4-lbwjj" in namespace "openshift-monitoring" to terminate: context deadline exceeded]

In both cases, the node will be cordoned again if it is not:

I0512 14:17:32.985942       1 drain_controller.go:417] Previous node drain found. Drain has been going on for 0.025946378425833333 hours
I0512 14:17:32.986014       1 drain_controller.go:436] External actor has unexpectedly uncordoned node djoshy-dev-104-x5n4x-master-2, cordoning again...

Summary by CodeRabbit

  • Bug Fixes

  • Fixed drain completion detection to prevent false completion signals when nodes are externally uncordoned during the drain process.

  • Enhanced drain trigger detection to monitor both drainer status and node schedulability changes.

  • Improved retry handling for eviction errors during drain operations.

  • Tests

  • Added test coverage for external uncordoning scenarios during drain operations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy
Copy link
Copy Markdown
Contributor Author

djoshy commented May 12, 2026

/cherry-pick release-4.22

@openshift-cherrypick-robot
Copy link
Copy Markdown

@djoshy: once the present PR merges, I will cherry-pick it on top of release-4.22 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

@djoshy: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants