Skip to content

fix(host-rpc): prevent block gap when buffer exhaustion resets backfill#131

Closed
rswanson wants to merge 1 commit intomainfrom
fix/backfill-gap-on-buffer-exhaustion
Closed

fix(host-rpc): prevent block gap when buffer exhaustion resets backfill#131
rswanson wants to merge 1 commit intomainfrom
fix/backfill-gap-on-buffer-exhaustion

Conversation

@rswanson
Copy link
Copy Markdown
Member

@rswanson rswanson commented Apr 3, 2026

Summary

  • Root cause: When walk_chain returned WalkResult::Exhausted, the notifier set backfill_from = finalized, which could be ahead of the last delivered block. This created a gap of undelivered blocks (e.g. 14 blocks between host 24800925 and finalized 24800939), causing "parent ru block not present in DB" crashes during initial sync.
  • Primary fix (host-rpc/notifier.rs): Compute resume_from = min(chain_view.back + 1, finalized) before clearing the buffer, ensuring backfill restarts from where we left off rather than jumping ahead.
  • Defensive check (node/node.rs): Add gap detection in process_committed_chain — if the first block to process isn't contiguous with the last stored block, bail with a clear error message instead of the cryptic parent-not-found error.

Reproduction scenario

During initial sync of signet-sidecar, the backfill ceiling landed close to the current tip. The first incoming newHead was >64 blocks ahead of the chain_view's latest entry, exhausting the buffer immediately. The notifier then reset to finalized (24800939) while the last delivered block was 24800925, skipping blocks 24800926–24800938.

Test plan

  • Deploy updated signet-sidecar to a fresh node and verify it syncs through the backfill→frontfill transition without crashing
  • Verify logs show resume_from in the exhaustion warning when buffer is exhausted
  • Verify that if a gap is somehow introduced, the node logs the new "notification gap" error instead of the opaque "parent ru block not present in DB"

🤖 Generated with Claude Code

When walk_chain exhausted the buffer, backfill_from was set to
cached_finalized, which could be ahead of the last delivered block.
This created a gap of undelivered blocks, causing "parent ru block
not present in DB" crashes during initial sync.

Now computes resume_from as min(chain_view.back + 1, finalized) to
ensure continuity. Also adds a defensive gap check in the node's
process_committed_chain to bail with a clear error message if a
notification gap is ever detected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rswanson rswanson requested a review from a team as a code owner April 3, 2026 19:18
Copy link
Copy Markdown
Member

@prestwich prestwich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this fix break our guarantee that notifications are always contiguous by setting the next notification earlier than the previous?

@prestwich
Copy link
Copy Markdown
Member

[Claude Code]

This fix has a correctness issue: comparing against finalized (or chain_view.back()) is the wrong anchor for resuming after buffer exhaustion. Both are proxies for "what have we delivered downstream," but neither actually tracks that.

The bug

resume_from = (chain_view.back() + 1).min(finalized) goes backward whenever finalized lags behind the view tip, which is the normal case. If chain_view has blocks up to 100 and finalized is 95, we set backfill_from = 95 and re-emit blocks 95+ — violating the contiguity guarantee that notifications only move forward.

The original code (backfill_from = finalized) has the same fundamental problem. Finalized is an L1 consensus concept with no relationship to what the notifier has actually delivered. It can be behind, ahead, or sideways relative to our last emission.

The fix

The notifier needs a high-water mark: a field tracking the highest block number it has actually emitted in a HostNotification. On buffer exhaustion, resume from high_water_mark + 1, unconditionally. No comparison to finalized, no comparison to chain_view entries.

This gives us:

  • Forward-only progress — we never backtrack past what we've delivered
  • Contiguity preserved — next emission starts exactly where the last left off
  • Decoupled from L1 tags — finalized/safe are irrelevant to delivery bookkeeping

The chain_view still gets cleared (it's a hash-walk buffer, not a delivery record), and backfill picks up from the delivery watermark.

The guard added in node.rs is good as a defensive check, but it shouldn't be the mechanism that papers over a notifier that can emit non-contiguous or backward-moving notifications.

@prestwich
Copy link
Copy Markdown
Member

superseded by #133

@prestwich prestwich closed this Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants