Skip to content

net: don't constantly post notifications to clients until they send#2534

Open
hawkw wants to merge 1 commit into
masterfrom
eliza/client-isnt-waiting-to-send
Open

net: don't constantly post notifications to clients until they send#2534
hawkw wants to merge 1 commit into
masterfrom
eliza/client-isnt-waiting-to-send

Conversation

@hawkw
Copy link
Copy Markdown
Member

@hawkw hawkw commented May 24, 2026

Currently, as described in #2532, if the system is connected to a management network, the net task posts notifications to its client tasks somewhat incessantly until the first time they actually send a packet. These notifications are spurious. Ultimately, this behavior was part of the root cause of #2528: due to a couple bugs in hiffy, the spurious notifications could prevent task-hiffy from checking whether a debug probe has attempted to kick it into executing a HIF script when hiffy is compiled with network RPC support. PR #2530 fixed the bugs in hiffy, so it now behaves correctly no matter how many times net notifies it. However, I still wanted to get to the bottom of the spurious notifications.

The details of my adventures debugging this can be found in issue #2532. Essentially, what's going on here is that, when net is woken by IP activity, it calls a function called wake_sockets(), which is what delivers notifications to client tasks. wake_sockets() will notify a client task if either of the following conditions is true:

  1. The socket's RX buffer is non-empty (the function can_recv() in smoltcp returns true)
  2. The socket's TX buffer is not full (the function can_send() in smoltcp returns true) and a flag in an array called client_waiting_to_send is set for that socket's index. This is intended to only notify clients if they have previously attempted a send and the buffer was full, so that they can try again. We don't want to notify clients any time their TX buffer has space, as they may not actually care at that point in time.

Here's where we arrive at the bug. As I described in this comment, the values of client_waiting_to_send is initialized to true for every socket when the net task starts. As the comment from @cbiffle states,

The 'true' here is load-bearing: it ensures that sockets receive
a notification on stack restart.

Notifying every socket when the net task restarts is necessary because a client may have attempted to send a packet, found that the buffer was full, and then gone into sys_recv_notification to wait for send capacity, and then the net task panicked. When the net task starts back up, it doesn't know that client task was waiting for send capacity, and therefore would never wake it. Thus, we must assume that every task is waiting to send when we restart.

However, the current behavior results in spurious notifications because the flag in client_waiting_to_send never gets cleared until that client actually does a send_packet successfully, as I described here. In almost all the management network communication protocols we use today, the SP is never the initiator, so in general, most tasks will not try to send packets on their own in the absence of a request from the control plane or humility. This means that when the net task starts up for the first time, and no tasks were waiting on send capacity from a previous net generation, it will notify them any time it sees IP activity. If the client is not waiting to send, it will generally handle the notification by trying to call recv_packet, and then it gets back Err(QueueEmpty), goes "huh, I guess there's nothing for me", and go back to waiting for a notification. Then, the next time there's any IP activity on the management network, the whole thing will happen again, because the client_waiting_to_send flags never get cleared until the task actually does a send.

This doesn't manifest as much in the product, because the control plane will periodically talk to the SP on both its control-plane-agent socket (for polling sensor data) and its snitch socket (for collecting ereports). The first time the control plane sends the SP a request and the SP sends a packet in response, the client_waiting_to_send flag gets cleared and that task gets out of the spurious notification state pretty quickly. But on a bench Cosmo connected to a Minibar where nothing is actually trying to talk to it over the management network, the net task basically just keeps waking everyone forever. Most tasks are more or less resilient to this, but (as seen with hiffy in #2528), it can trigger weird bugs that are sensitive to spurious notifications.

This commit fixes it in the simplest possible way. Instead of initializing client_waiting_to_send to true, we initialize it to false and only set the flag if the client actually calls send_packet(). To ensure that everyone is woken up when net restarts in case they were waiting in a send_packet() when net panics, we just blast out a notification to everyone immediately upon startup, exactly one time. This doesn't need to check if there's actually space in the TX buffers, because, well, the net task just started, so we know there's nothing in the buffer. I'm pretty sure this should be totally fine: if a client task is waiting for send capacity after all this happens, and it restarts, it will have to go all the way back through whatever code path led to it sending send_packet() and encountering the full buffer again, so client_waiting_in_send will be set by then anyway. But, I'd love confirmation from @cbiffle that this is correct.

I did some testing to validate the fix, which can be read here.

Fixes #2532

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

net task posts spurious notifications to clients until they actually call send_packet()

1 participant