Skip to content

fix(core/txpool): coordinate reset lifecycle and shutdown signaling #28837#2132

Open
gzliudan wants to merge 1 commit intoXinFinOrg:dev-upgradefrom
gzliudan:sim-tx-lockstep
Open

fix(core/txpool): coordinate reset lifecycle and shutdown signaling #28837#2132
gzliudan wants to merge 1 commit intoXinFinOrg:dev-upgradefrom
gzliudan:sim-tx-lockstep

Conversation

@gzliudan
Copy link
Collaborator

@gzliudan gzliudan commented Mar 4, 2026

Proposed changes

Improve txpool loop synchronization around background resets.

This change:

  • adds an explicit termination channel to signal pool shutdown
  • tracks forced-reset intent and a waiter channel inside the reset loop
  • ensures reset waiters are notified on completion or on pool termination
  • allows an explicit sync request path to trigger an additional reset round when needed

Scope is limited to internal txpool concurrency control in core/txpool/txpool.go, with no protocol or RPC behavior change.

Ref: ethereum#28837

Types of changes

What types of changes does your code introduce to XDC network?
Put an in the boxes that apply

  • build: Changes that affect the build system or external dependencies
  • ci: Changes to CI configuration files and scripts
  • chore: Changes that don't change source code or tests
  • docs: Documentation only changes
  • feat: A new feature
  • fix: A bug fix
  • perf: A code change that improves performance
  • refactor: A code change that neither fixes a bug nor adds a feature
  • revert: Revert something
  • style: Changes that do not affect the meaning of the code
  • test: Adding missing tests or correcting existing tests

Impacted Components

Which parts of the codebase does this PR touch?
Put an in the boxes that apply

  • Consensus
  • Account
  • Network
  • Geth
  • Smart Contract
  • External components
  • Not sure (Please specify below)

Checklist

Put an in the boxes once you have confirmed below actions (or provide reasons on not doing so) that

  • This PR has sufficient test coverage (unit/integration test) OR I have provided reason in the PR description for not having test coverage
  • Tested on a private network from the genesis block and monitored the chain operating correctly for multiple epochs.
  • Provide an end-to-end test plan in the PR description on how to manually test it on the devnet/testnet.
  • Tested the backwards compatibility.
  • Tested with XDC nodes running this version co-exist with those running the previous version.
  • Relevant documentation has been updated as part of this PR
  • N/A

Copilot AI review requested due to automatic review settings March 4, 2026 03:48
@coderabbitai
Copy link

coderabbitai bot commented Mar 4, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: caebb303-9132-4825-8169-c8a55407638d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves internal synchronization within the core/txpool/txpool.go event loop, specifically around background reset lifecycle management and shutdown signaling. It introduces a termination channel (term) to signal when the pool has stopped, a sync channel for simulator/testing use to force synchronous reset completion, and resetForced/resetWaiter state variables to track pending forced resets and notify waiters upon completion or pool termination.

Changes:

  • Added term chan struct{} field to signal pool termination (closed via defer in loop())
  • Added sync chan chan error field plus resetForced / resetWaiter state to support forced-reset synchronization for simulator use
  • Ensured active reset waiters are notified with an error upon pool shutdown

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +77 to +78
term: make(chan struct{}),
sync: make(chan chan error),
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync field (chan chan error) is added to the TxPool struct and initialized in New(), and the loop consumes from it via case syncc := <-p.sync. However, there is no exported method on TxPool (such as a Sync() error method) that sends to this channel. The channel is therefore inaccessible to any caller outside the package, making the entire sync/forced-reset mechanism completely unreachable from the intended simulator or testing use cases described in the PR. The companion public method is missing from this change.

Copilot uses AI. Check for mistakes.
Comment on lines +142 to +145
// Notify the live reset waiter to not block if the txpool is closed.
defer func() {
if resetWaiter != nil {
resetWaiter <- errors.New("pool already terminated")
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shutdown defer (lines 143–148) sends on resetWaiter using a blocking send (resetWaiter <- errors.New("pool already terminated")). If the caller that originally sent its channel via p.sync is no longer actively receiving (e.g., it has already been cancelled or timed out), this blocking send will deadlock the loop() goroutine's shutdown path indefinitely, preventing the pool from terminating cleanly. The channel should be sent to in a non-blocking manner or using a select with a default branch.

Suggested change
// Notify the live reset waiter to not block if the txpool is closed.
defer func() {
if resetWaiter != nil {
resetWaiter <- errors.New("pool already terminated")
// Notify the live reset waiter without blocking if the txpool is closed.
defer func() {
if resetWaiter != nil {
select {
case resetWaiter <- errors.New("pool already terminated"):
default:
}

Copilot uses AI. Check for mistakes.
// the forced op is still pending. In that case, wait another round
// of resets.
if resetWaiter != nil && !resetForced {
resetWaiter <- nil
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notification at resetWaiter <- nil (line 194) is a blocking send on an unbuffered chan error. If a Sync() caller has abandoned its wait (e.g., due to context cancellation or timeout), this send will permanently block the main loop goroutine, preventing it from processing any further chain head events or shutdown signals. The send should be non-blocking (using a select with default) to avoid stalling the event loop.

Suggested change
resetWaiter <- nil
select {
case resetWaiter <- nil:
// notification delivered
default:
// no active listener; avoid blocking the event loop
}

Copilot uses AI. Check for mistakes.

subs event.SubscriptionScope // Subscription scope to unsubscribe all on shutdown
quit chan chan error // Quit channel to tear down the head updater
term chan struct{} // Termination channel to detect a closed pool
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term channel is declared in the TxPool struct, initialized in New(), and closed via defer close(p.term) in loop(), but it is never read anywhere in the codebase. There are no callers selecting on p.term to detect pool termination. As written, this field has no observable effect and is dead code. Either the channel should be exposed (e.g., via a Terminated() method that returns the channel for callers to select on), or it should be removed to avoid confusion.

Copilot uses AI. Check for mistakes.
// that any running reset operation finishes to make block imports
// deterministic. On top of that, run a new reset operation to make
// transaction insertions deterministic instead of being stuck in a
// queue waiting for a reset.
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a new sync request arrives via case syncc := <-p.sync, the code unconditionally assigns resetWaiter = syncc. If a previous resetWaiter was already set (from an earlier sync request that has not yet been fulfilled), it is silently overwritten and the previous caller goroutine that is blocked waiting on its channel will never receive a notification, causing an indefinite goroutine leak/hang.

Before overwriting resetWaiter, the prior waiter should be notified (e.g., with an appropriate error) to prevent the leak.

Suggested change
// queue waiting for a reset.
// queue waiting for a reset.
if resetWaiter != nil {
// A previous sync waiter is still pending; notify it to avoid
// leaking a goroutine waiting on the old channel.
resetWaiter <- errors.New("sync request superseded by a new request")
resetWaiter = nil
}

Copilot uses AI. Check for mistakes.
…thereum#28837

Improve txpool loop synchronization around background resets.

This change:
- adds an explicit termination channel to signal pool shutdown
- tracks forced-reset intent and a waiter channel inside the reset loop
- ensures reset waiters are notified on completion or on pool termination
- allows an explicit sync request path to trigger an additional reset round when needed

Scope is limited to internal txpool concurrency control in core/txpool/txpool.go, with no protocol or RPC behavior change.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +426 to +434
func (p *TxPool) Sync() error {
sync := make(chan error)
select {
case p.sync <- sync:
return <-sync
case <-p.term:
return errors.New("pool already terminated")
}
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Sync() method, resetWaiter handling, and term channel shutdown signaling introduce complex concurrency behavior that lacks any unit test coverage. There are no test files at the core/txpool package level. Given the complexity of the added synchronization logic (e.g., forced reset lifecycle, waiter notification on pool termination), adding test cases to verify correct behavior and prevent regressions would be valuable. For example, tests for: (1) Sync() unblocking after a reset completes, (2) Sync() returning an error when the pool is closed, (3) correct waiter notification on pool shutdown.

Copilot uses AI. Check for mistakes.
Comment on lines +426 to +432
func (p *TxPool) Sync() error {
sync := make(chan error)
select {
case p.sync <- sync:
return <-sync
case <-p.term:
return errors.New("pool already terminated")
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string "pool already terminated" is used in two separate errors.New() calls (line 145 in the defer and line 432 in Sync()), resulting in two distinct error instances. The existing pattern in errors.go defines all package-level errors as exported sentinel variables (e.g., ErrAlreadyKnown, ErrTxPoolOverflow), which allows callers to compare with errors.Is(). A sentinel error such as ErrPoolTerminated would be consistent with this codebase convention and easier to compare programmatically.

Copilot uses AI. Check for mistakes.
Comment on lines +143 to +145
defer func() {
if resetWaiter != nil {
resetWaiter <- errors.New("pool already terminated")
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This errors.New("pool already terminated") and the identical one in Sync() at line 432 are two separate error instances. Replacing both with a shared sentinel variable (e.g., ErrPoolTerminated defined in errors.go) would follow the existing pattern of package-level error variables in this file and make the error checkable via errors.Is().

Copilot uses AI. Check for mistakes.
@gzliudan gzliudan changed the title fix(core/txpool): coordinate reset lifecycle and shutdown signaling #28837 [WIP] fix(core/txpool): coordinate reset lifecycle and shutdown signaling #28837 Mar 4, 2026
@gzliudan gzliudan added the WIP work in process label Mar 10, 2026
@gzliudan gzliudan changed the title [WIP] fix(core/txpool): coordinate reset lifecycle and shutdown signaling #28837 fix(core/txpool): coordinate reset lifecycle and shutdown signaling #28837 Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

WIP work in process

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants