Add opt-in MachineConfig for graceful VM shutdown#172
Conversation
|
❌ Generated Files Verification Failed One or more generated files in this PR are out of sync:
Please regenerate the files locally and commit the changes. |
ff6b3f2 to
50b5f18
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
❌ Generated Files Verification Failed One or more generated files in this PR are out of sync:
Please regenerate the files locally and commit the changes. |
This adds an optional MachineConfig that can be enabled by setting the `platform.kubevirt.io/vm-drain-shutdown-inhibitor` annotation. This MC installs the following things: * a logind configuration file that increases the max inhibitor delay * a python script that registers a systemd-inhibitor which waits for PrepareForShutdown signals and responds by shutting down any running virt-launcher containers. * a systemd unit that starts the python script This is to deal with a niche situation where a node has become degraded and inaccessible from the control plane, but the node still has VM workloads operating on it. Because the processes in a container scope all receive the SIGTERM sent during shutdown, qemu, libvirtd, and the virt-launcher can be killed out of order without having a chance for the virt-launcher to gracefully shutdown the VM. Inhibiting shutdown allows the python script to access each container and instruct virt-launcher to shut it down gracefully ahead of the rest of the node shutdown sequence. Claude wrote most of the script under my guidance, I rewrote portions and reorganized it. I've run the inhibitor script in a lab cluster and it appears to work properly. Co-Authored-By: Claude Code Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Lucidi <slucidi@redhat.com>
50b5f18 to
37c867d
Compare
|
❌ Generated Files Verification Failed One or more generated files in this PR are out of sync:
Please regenerate the files locally and commit the changes. |
cc @fabiand
This is a workaround for https://redhat.atlassian.net/browse/CNV-14216
This adds an optional MachineConfig that can be enabled by setting the
platform.kubevirt.io/vm-drain-shutdown-inhibitorannotation. This MC installs the following things:This is to deal with a niche situation where a node has become degraded and inaccessible from the control plane, but the node still has VM workloads operating on it. Because the processes in a container scope all receive the SIGTERM sent during shutdown, qemu, libvirtd, and the virt-launcher can be killed out of order without having a chance for the virt-launcher to gracefully shutdown the VM. Inhibiting shutdown allows the python script to access each container and instruct virt-launcher to shut it down gracefully ahead of the rest of the node shutdown sequence.
Claude wrote most of the python script under my guidance, I rewrote portions and reorganized it. I've run the inhibitor script in a lab cluster and it appears to work properly. Since the script is base64 encoded in the MachineConfig, I have included it below for reference. It's also available in https://github.com/mansam/scripts.