Skip to content

netvsp: handle eqe 135 and reconfigure vf on release/1.7.2511#2610

Merged
mattkur merged 2 commits intomicrosoft:release/1.7.2511from
erfrimod:erfrimod/netvsp-eqe-135-on-release-1.7.2511
Jan 24, 2026
Merged

netvsp: handle eqe 135 and reconfigure vf on release/1.7.2511#2610
mattkur merged 2 commits intomicrosoft:release/1.7.2511from
erfrimod:erfrimod/netvsp-eqe-135-on-release-1.7.2511

Conversation

@erfrimod
Copy link
Copy Markdown
Contributor

@erfrimod erfrimod commented Jan 5, 2026

Clean cherry-pick of #2576

For netvsp to recover from SoC crash and NMC servicing, it must detect EQE 135 from MANA and reconfigure the Virtual Function.

  • GDMA_EQE_HWC_RECONFIG_VF added to events handled by the GDMA driver
  • vf_reconfiguration_pending bool added the GDMA driver. Set when the EQE is received, read by the MANA driver when processing all EQ events.
  • vf_reconfig_sender added to MANA driver to signal Netvsp VF Manager when it sees 'pending' is true
  • vf_reconfig_receiver added to Netvsp HclNetworkVFManagerWorker to send VFReconfig message when signaled
  • VfReconfig message added to Netvsp VF Manager, which removes the old VF and then creates a new VF

Smaller changes:

  • GDMA_GENERATE_TEST_EQE and GDMA_GENERATE_RECONFIG_VF_EVENT added in order to create a unit test, test_gdma_reconfig_vf()
  • The logic of NextWorkItem::ManaDeviceArrived refactored into startup_vtl2_device() so the logic can be shared with VfReconfig

Testing:

  • Unit tests pass
  • Tested on a lab machine with SoC MANA privates which allowed EQE 135 to be generated by command. Netvsp is able to see the EQE and VfReconfig is called. Before and after sending the EQE, ping and ntttcp traffic succeed.

For netvsp to recover from SoC crash and NMC servicing, it must detect
EQE 135 from MANA and reconfigure the Virtual Function.

* `GDMA_EQE_HWC_RECONFIG_VF` added to events handled by the GDMA driver
* `vf_reconfiguration_pending` bool added the GDMA driver. Set when the
EQE is received, read by the MANA driver when processing all EQ events.
* `vf_reconfig_sender` added to MANA driver to signal Netvsp VF Manager
when it sees 'pending' is true
* `vf_reconfig_receiver` added to Netvsp HclNetworkVFManagerWorker to
send `VFReconfig` message when signaled
* `VfReconfig` message added to Netvsp VF Manager, which removes the old
VF and then creates a new VF

Smaller changes:
* `GDMA_GENERATE_TEST_EQE` and `GDMA_GENERATE_RECONFIG_VF_EVENT` added
in order to create a unit test, `test_gdma_reconfig_vf()`
* The logic of `NextWorkItem::ManaDeviceArrived` refactored into
`startup_vtl2_device()` so the logic can be shared with `VfReconfig`

Testing:
* Unit tests pass
* Tested on a lab machine with SoC MANA privates which allowed EQE 135
to be generated by command. Netvsp is able to see the EQE and VfReconfig
is called. Before and after sending the EQE, ping and ntttcp traffic
succeed.
@erfrimod erfrimod requested a review from a team as a code owner January 5, 2026 21:58
Copilot AI review requested due to automatic review settings January 5, 2026 21:58
@github-actions github-actions Bot added the release_1.7.2511 Targets the release/1.7.2511 branch. label Jan 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for handling EQE 135 (VF reconfiguration event) to enable netvsp recovery from SoC crashes and NMC servicing. The implementation spans three layers: GDMA driver (event detection), MANA driver (event propagation), and Netvsp (VF reconfiguration orchestration).

Key changes:

  • GDMA driver now handles GDMA_EQE_HWC_RECONFIG_VF events and maintains a vf_reconfiguration_pending flag
  • MANA driver propagates VF reconfig events through a mesh channel subscription mechanism
  • Netvsp implements a state machine with exponential backoff retry logic to gracefully restart the VTL2 device after reconfiguration
  • Full save/restore support for the pending flag to handle servicing scenarios

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
vm/devices/net/gdma_defs/src/lib.rs Adds EQE 135 constant and test request type for VF reconfiguration
vm/devices/net/gdma/src/hwc.rs Implements hardware control handler to generate VF reconfig events for testing
vm/devices/net/mana_driver/src/gdma_driver.rs Adds vf_reconfiguration_pending flag, event handling, getter method, and test helper
vm/devices/net/mana_driver/src/mana.rs Implements subscription mechanism for VF reconfig events with sender/receiver channel
vm/devices/net/mana_driver/src/save_restore.rs Extends saved state to include vf_reconfiguration_pending flag
vm/devices/net/mana_driver/src/tests.rs Adds unit test verifying VF reconfiguration event detection and flag behavior
openhcl/underhill_core/src/emuplat/netvsp.rs Implements VF reconfiguration state machine with shutdown, restart, and exponential backoff retry logic; refactors device startup into shared method

Comment thread openhcl/underhill_core/src/emuplat/netvsp.rs
NextWorkItem::Continue
let exists = Path::new(&device_path).exists();
match (vtl2_device_state, exists) {
(Vtl2DeviceState::Missing, true) => NextWorkItem::ManaDeviceArrived,
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device_arrival event is only processed when vtl2_device_state is Missing (line 644), but this excludes the Reconfiguring state. If a uevent for device arrival occurs while in the Reconfiguring state (e.g., the device actually becomes available), it will be ignored until the next uevent. This could cause unnecessary retry delays. Consider also handling device arrival during Reconfiguring state to allow faster recovery.

Suggested change
(Vtl2DeviceState::Missing, true) => NextWorkItem::ManaDeviceArrived,
(Vtl2DeviceState::Missing | Vtl2DeviceState::Reconfiguring, true) => {
NextWorkItem::ManaDeviceArrived
}

Copilot uses AI. Check for mistakes.
Comment thread vm/devices/net/mana_driver/src/gdma_driver.rs
Comment thread openhcl/underhill_core/src/emuplat/netvsp.rs
Comment on lines +890 to +892
// Don't 'keep alive'. VTL2 is reconfigured when in a bad state.
let keep_vf_alive = false;
self.shutdown_vtl2_device(keep_vf_alive).await;
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After shutdown_vtl2_device is called (line 892), the vf_reconfig_receiver retains the old receiver whose sender was dropped during shutdown. If startup_vtl2_device fails to create a new device during retry attempts, subsequent VF reconfiguration events will be lost until the device successfully restarts. Consider either clearing vf_reconfig_receiver in shutdown_vtl2_device and checking for None in the event loop, or documenting this behavior to make the subtle coupling explicit.

Copilot uses AI. Check for mistakes.
Comment on lines +554 to +555
device.start_notification_task(&self.driver_source).await;
self.vf_reconfig_receiver = Some(device.subscribe_vf_reconfig().await);
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential race condition between subscribing to VF reconfig events and processing pending events. The subscription happens in startup_vtl2_device (line 555) after start_notification_task (line 554). However, if a VF reconfiguration event was already pending in the GDMA driver before the subscription, it could be processed and sent to a non-existent receiver, causing the event to be lost. Consider subscribing to VF reconfig events before starting the notification task, or ensure the pending flag is checked after subscription.

Suggested change
device.start_notification_task(&self.driver_source).await;
self.vf_reconfig_receiver = Some(device.subscribe_vf_reconfig().await);
self.vf_reconfig_receiver = Some(device.subscribe_vf_reconfig().await);
device.start_notification_task(&self.driver_source).await;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur with this, and also in the other location below. If we are racing with a device reconfig, we could miss this and the device would be broken from the start. This can be a follow up PR in main.

Comment thread vm/devices/net/mana_driver/src/tests.rs
Comment thread openhcl/underhill_core/src/emuplat/netvsp.rs
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jan 5, 2026

}

pub async fn save(&mut self) -> anyhow::Result<GdmaDriverSavedState> {
if self.hwc_failure {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make a follow up PR to remove vf_reconfiguration_pending from save state and just fail here instead. Otherwise we will save/restore the device only to have our first action be to tear down and recreate.

@justus-camp-microsoft I notice when save fails we still leak the device instead of cleaning it up. Is this intentional? This would make my proposal not work as well since the device would be left running in the bad state.

Comment on lines +554 to +555
device.start_notification_task(&self.driver_source).await;
self.vf_reconfig_receiver = Some(device.subscribe_vf_reconfig().await);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur with this, and also in the other location below. If we are racing with a device reconfig, we could miss this and the device would be broken from the start. This can be a follow up PR in main.

mattkur added a commit that referenced this pull request Jan 24, 2026
#2679)

This cherry-picks the following changes, and includes a few minor merge
conflict fix-ups.

* vmm_tests: add hyper-v openhcl pcat tests (#2602)
* petri: backend agnostic additional disk configuration (#2551)
* vmm_tests/underhill_core: allow command line to specify settings /
config timeout, and make it 30s for many devices test (#2619)
* mesh/petri/vmm_tests/vpci: allow env vars when launching mesh process
+ verbose vpci logs (#2567)
* petri: backend agnostic vtl2 settings configuration (#2550)
* petri: allow more time for the vm to be off during reboot (#2533)
* petri: check if the VM is off when waiting for hyper-v events (#2525)

---------

Co-authored-by: Trevor Jones <trevor@thjmedia.net>
@mattkur mattkur requested a review from a team as a code owner January 24, 2026 14:39
Copy link
Copy Markdown
Contributor

@mattkur mattkur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for 1.7 based on Brian's approval.

@mattkur mattkur enabled auto-merge (squash) January 24, 2026 14:40
@github-actions
Copy link
Copy Markdown

@mattkur mattkur merged commit 8f73df3 into microsoft:release/1.7.2511 Jan 24, 2026
76 of 78 checks passed
@erfrimod erfrimod deleted the erfrimod/netvsp-eqe-135-on-release-1.7.2511 branch January 28, 2026 20:17
benhillis pushed a commit to benhillis/openvmm that referenced this pull request Jan 29, 2026
benhillis pushed a commit to benhillis/openvmm that referenced this pull request Jan 29, 2026
* release/1.7.2511:
  netvsp: handle eqe 135 and reconfigure vf on release/1.7.2511 (microsoft#2610)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release_1.7.2511 Targets the release/1.7.2511 branch.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants