Skip to content

Fix scheduler bugs: heap invariant corruption and lost stop event at dispatch boundary#246

Merged
Marenz merged 4 commits intofrequenz-floss:v0.x.xfrom
Marenz:fix/heap-invariant-and-lost-stop-event
Mar 12, 2026
Merged

Fix scheduler bugs: heap invariant corruption and lost stop event at dispatch boundary#246
Marenz merged 4 commits intofrequenz-floss:v0.x.xfrom
Marenz:fix/heap-invariant-and-lost-stop-event

Conversation

@Marenz
Copy link
Contributor

@Marenz Marenz commented Mar 11, 2026

This PR fixes two scheduler bugs when updating an existing dispatch.

  1. Removing a queued event could corrupt the scheduler heap because the queue was not re-heapified afterward, which could lead to incorrect ordering of future events.
  2. An update exactly at the stop boundary could remove the pending stop event without notifying the actor, causing the stop transition to be missed.

This PR restores the heap after removals, tracks which queued event was removed, and sends the missing stop notification when a boundary update unschedules the pending stop event.

Regression tests and release notes were added.

@github-actions github-actions bot added part:tests Affects the unit, integration and performance (benchmarks) tests part:dispatcher Affects the high-level dispatcher interface part:docs Affects the documentation labels Mar 11, 2026
@Marenz Marenz marked this pull request as ready for review March 11, 2026 10:41
@Marenz Marenz requested a review from a team as a code owner March 11, 2026 10:41
@Marenz Marenz requested review from stefan-brus-frequenz and removed request for a team March 11, 2026 10:41
@Marenz Marenz force-pushed the fix/heap-invariant-and-lost-stop-event branch 2 times, most recently from 9743e85 to f9a160e Compare March 11, 2026 10:59
@Marenz Marenz enabled auto-merge March 11, 2026 11:06
Copy link
Contributor

@llucax llucax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments about structure mainly. Also a few extra minor comments:

  1. No empty line between the last line and signed-off-by in the commit message:

    Fix: call heapify() after pop(idx) to restore the heap invariant.
    Signed-off-by: Mathias L. Baumann <mathias.baumann@frequenz.com>
    
  2. The release notes update is mixed in the last commit, I guess it makes more sense to put it in its own commit.

  3. The PR description also IMHO focus on the wrong stuff, it didn't help a lot understanding what was not working. Asking AI to create a better description that explains what was exactly the bug, and how it was fixed by the PR, it came up with:

    This PR fixes two scheduler bugs when updating an existing dispatch.

    1. Removing a queued event could corrupt the scheduler heap because the queue was not re-heapified afterward, which could lead to incorrect ordering of future events.
    2. An update exactly at the stop boundary could remove the pending stop event without notifying the actor, causing the stop transition to be missed.

    This PR restores the heap after removals, tracks which queued event was removed, and sends the missing stop notification when a boundary update unschedules the pending stop event.

    Regression tests and release notes were added.

PS: TIL: re-sifted == re-heapified

@llucax
Copy link
Contributor

llucax commented Mar 11, 2026

One more question. Could the boundary issue happen for updates happen at the start boundary instead of the stop boundary?

@Marenz Marenz changed the title Fix dispatch rescheduling around stop boundaries Fix scheduler bugs: heap invariant corruption and lost stop event at dispatch boundary Mar 11, 2026
@Marenz Marenz force-pushed the fix/heap-invariant-and-lost-stop-event branch from f9a160e to 30182fa Compare March 11, 2026 14:00
Marenz added 2 commits March 11, 2026 15:08
list.pop(idx) removes an arbitrary element from the heap without
restoring the heap property. Add heapify() after the removal to fix
the invariant.

Change the return type from bool to QueueItem | None so callers can
inspect the removed item (needed for the follow-up fix).

Signed-off-by: Mathias L. Baumann <mathias.baumann@frequenz.com>
When a dispatch update arrives at exactly the stop boundary,
_remove_scheduled consumes the pending stop event from the queue.
_update_changed_running_state does not fire because old_dispatch.started
and dispatch.started are both False at this point. The actor would then
remain running indefinitely.

Fix: if _remove_scheduled returned a stop event (priority == 1) and the
dispatch is no longer started, send the running-state change explicitly.

Signed-off-by: Mathias L. Baumann <mathias.baumann@frequenz.com>
@Marenz Marenz force-pushed the fix/heap-invariant-and-lost-stop-event branch 3 times, most recently from becf4de to aafc196 Compare March 11, 2026 14:38
Marenz added 2 commits March 11, 2026 15:40
Signed-off-by: Mathias L. Baumann <mathias.baumann@frequenz.com>
Signed-off-by: Mathias L. Baumann <mathias.baumann@frequenz.com>
@Marenz Marenz force-pushed the fix/heap-invariant-and-lost-stop-event branch from aafc196 to f1bb780 Compare March 11, 2026 14:40
@Marenz
Copy link
Contributor Author

Marenz commented Mar 11, 2026

Okay, I think I addressed all feedback now

@Marenz
Copy link
Contributor Author

Marenz commented Mar 11, 2026

One more question. Could the boundary issue happen for updates happen at the start boundary instead of the stop boundary?

The lost-event bug only occurs at the stop boundary, not the start boundary.

At the start boundary, the update path already has a fallback notification path through _update_changed_running_state(). The key point is that started is one of the runtime_state_attributes, and started is derived from the dispatch window and the current time, not stored independently.

That means:

  • just before the start boundary, old_dispatch.started == False
  • once the boundary is crossed, dispatch.started == True

So even if _remove_scheduled() consumed the queued start event first, _update_changed_running_state() would still see the started transition and emit the running-state update.

At the stop boundary, the problematic case is different:

  • _remove_scheduled() can consume the queued stop event
  • but the concurrent update may not independently trigger _update_changed_running_state()
    In that case, nothing replaces the consumed stop event, so the stop notification is lost.

Copy link
Contributor

@llucax llucax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@Marenz Marenz merged commit 24b2f81 into frequenz-floss:v0.x.x Mar 12, 2026
5 checks passed
@llucax
Copy link
Contributor

llucax commented Mar 12, 2026

Wow, no merge queue for this repo?

@llucax
Copy link
Contributor

llucax commented Mar 12, 2026

On a second thought, I'm doubtful the merge queue really adds any value to us, I guess the try-merge and remove, go to the next in the queue is good for monorepos, that might have many many concurrent merges, but for us, that we hardly have any (maybe only when dependabot kicks in), it is starting to feel it just makes things slower and more expensive (testing twice), without much extra gain. But I guess this is something to discuss elsewhere...

@Marenz Marenz deleted the fix/heap-invariant-and-lost-stop-event branch March 12, 2026 08:39
@Marenz
Copy link
Contributor Author

Marenz commented Mar 12, 2026

Shit. I targeted the wrong upstream branch. We are on v1.x.x

@Marenz
Copy link
Contributor Author

Marenz commented Mar 12, 2026

And that's also why there is no merge queue.

@llucax
Copy link
Contributor

llucax commented Mar 12, 2026

Maybe remove the v0.x.x branch if we are not making more releases from it?

Marenz added a commit that referenced this pull request Mar 12, 2026
Rebased follow-up of #246 for `v1.x.x`.

- restores the scheduler heap after removing queued events
- emits the missing stop notification when an update lands on the stop
boundary
- adds regression coverage for both scheduler issues
- adapts the new scheduler tests to the `time-machine` 3 API used on
`v1.x.x`
@Marenz
Copy link
Contributor Author

Marenz commented Mar 12, 2026

Maybe remove the v0.x.x branch if we are not making more releases from it?

already done :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

part:dispatcher Affects the high-level dispatcher interface part:docs Affects the documentation part:tests Affects the unit, integration and performance (benchmarks) tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants