[bugfix]: Revert usage of preStop hook, clean the OCI files internally#106
Conversation
…iton Signed-off-by: Igor Bezukh <ibezukh@redhat.com>
The removal will take place in SIGTERM handler. The pod configured with 5 seconds of graceful termination so there should be enough time to remove the files. In order to fix the race condition with another wasp pod creation each hook file will be assigned with pod name suffix. So that each wasp pod will be responsible to clean his own hook files. the only scenario where the cleanup won't work is the force pod delete, since in this case SIGKILL will be used. But this scenario wasn't covered with preStop as well. This is a known limitation that can be resolved by rolling out the node itself since the OCI hook files are stored at ephemeral directories. Signed-off-by: Igor Bezukh <ibezukh@redhat.com>
Signed-off-by: Igor Bezukh <ibezukh@redhat.com>
the unit tests will run inside the build container as part of make test the script also run ginkgo bootstrap on any new packages. This will create a new suite_test.go file Signed-off-by: Igor Bezukh <ibezukh@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/cc @Barakmor1 |
Barakmor1
left a comment
There was a problem hiding this comment.
As discussed privately, we might benefit from cleaning the config and script files during startup as well, in case an old Wasp pod crashed and did not get a chance to clean them properly.
A solution based on NRI instead of OCI would probably be better in the long run for the following reasons:
- As we saw, the hooks mechanism might introduce a race condition where the hook starts after the container has already finished running (iiuc, this should not happen with NRI).
- There might be a gap between container startup and when the hook begins operating. I wonder whether this could affect applications that consume a large amount of memory during startup, and whether the hook is sufficient to ensure they have enough memory available.
- Using NRI might allow us to remove the controller that manipulates the cgroup per container for the limited swap configuration.
However, this fix is necessary to resolve the current bug.
Thank you, and please tag me in the follow-up PR to address the race condition issue.
Signed-off-by: Igor Bezukh ibezukh@redhat.com
The preStop hook could race with newly created wasp-agent pod, this can happen in the following cases:
In both cases the terminating pod and the newly created pod are running on the same node.
This PR suggests that every wasp-agent pod will manage its own OCI files and their cleanup. Before copying the OCI files the application will attach a pod name to the filenames.
In addition some improvements were introduces in order to
make testto run the unit tests.