Skip to content

[bugfix]: Revert usage of preStop hook, clean the OCI files internally#106

Merged
enp0s3 merged 4 commits into
openshift-virtualization:mainfrom
enp0s3:fix-prestop-hook
May 11, 2026
Merged

[bugfix]: Revert usage of preStop hook, clean the OCI files internally#106
enp0s3 merged 4 commits into
openshift-virtualization:mainfrom
enp0s3:fix-prestop-hook

Conversation

@enp0s3
Copy link
Copy Markdown
Member

@enp0s3 enp0s3 commented May 10, 2026

Signed-off-by: Igor Bezukh ibezukh@redhat.com

The preStop hook could race with newly created wasp-agent pod, this can happen in the following cases:

  • Manual deletion of a wasp-agent pod
  • Frequent re-deployment of the wasp-agent daemonset
    In both cases the terminating pod and the newly created pod are running on the same node.

This PR suggests that every wasp-agent pod will manage its own OCI files and their cleanup. Before copying the OCI files the application will attach a pod name to the filenames.

In addition some improvements were introduces in order to make test to run the unit tests.

enp0s3 added 4 commits May 8, 2026 16:45
…iton

Signed-off-by: Igor Bezukh <ibezukh@redhat.com>
The removal will take place in SIGTERM handler.

The pod configured with 5 seconds of graceful termination so there
should be enough time to remove the files.

In order to fix the race condition with another wasp pod creation
each hook file will be assigned with pod name suffix. So that each
wasp pod will be responsible to clean his own hook files.

the only scenario where the cleanup won't work is the force
pod delete, since in this case SIGKILL will be used. But this
scenario wasn't covered with preStop as well. This is a known
limitation that can be resolved by rolling out the node itself
since the OCI hook files are stored at ephemeral directories.

Signed-off-by: Igor Bezukh <ibezukh@redhat.com>
Signed-off-by: Igor Bezukh <ibezukh@redhat.com>
the unit tests will run inside the build container

as part of make test the script also run ginkgo bootstrap on any
new packages. This will create a new suite_test.go file

Signed-off-by: Igor Bezukh <ibezukh@redhat.com>
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign enp0s3 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@enp0s3
Copy link
Copy Markdown
Member Author

enp0s3 commented May 10, 2026

/cc @Barakmor1

@openshift-ci openshift-ci Bot requested a review from Barakmor1 May 10, 2026 08:53
Copy link
Copy Markdown
Member

@Barakmor1 Barakmor1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed privately, we might benefit from cleaning the config and script files during startup as well, in case an old Wasp pod crashed and did not get a chance to clean them properly.

A solution based on NRI instead of OCI would probably be better in the long run for the following reasons:

  • As we saw, the hooks mechanism might introduce a race condition where the hook starts after the container has already finished running (iiuc, this should not happen with NRI).
  • There might be a gap between container startup and when the hook begins operating. I wonder whether this could affect applications that consume a large amount of memory during startup, and whether the hook is sufficient to ensure they have enough memory available.
  • Using NRI might allow us to remove the controller that manipulates the cgroup per container for the limited swap configuration.

However, this fix is necessary to resolve the current bug.

Thank you, and please tag me in the follow-up PR to address the race condition issue.

@openshift-ci openshift-ci Bot added the lgtm label May 11, 2026
@enp0s3 enp0s3 merged commit febd0f8 into openshift-virtualization:main May 11, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants