Skip to content

maintainer,dispatcher: match barrier done with action#4389

Open
3AceShowHand wants to merge 10 commits intopingcap:masterfrom
3AceShowHand:codex/fix-barrier-done-action
Open

maintainer,dispatcher: match barrier done with action#4389
3AceShowHand wants to merge 10 commits intopingcap:masterfrom
3AceShowHand:codex/fix-barrier-done-action

Conversation

@3AceShowHand
Copy link
Collaborator

@3AceShowHand 3AceShowHand commented Mar 7, 2026

What problem does this PR solve?

Issue Number: close #4398

#4263 introduced Flush as a maintainer-visible barrier action for storage sink. That solved the ordering problem, but it also pushed storage-sink-specific flush semantics into the maintainer state machine and made the barrier protocol three-phase.

That design turned out to be unnecessarily complex:

  • maintainer had to reason about Flush -> Write -> Pass
  • dispatcher and maintainer protocol had to carry an explicit Flush action
  • the flush behavior itself is local dispatcher/sink work, not something maintainer should orchestrate

This PR simplifies the design by moving the flush step back into dispatcher-local handling while keeping the storage-sink ordering guarantee.

What is changed and how it works?

The protocol is simplified back to Write -> Pass.

Dispatcher side:

  • when a blocking DDL or syncpoint is received, dispatcher now calls FlushDMLBeforeBlock locally before reporting WAITING to maintainer
  • for held DB/All block events on the table trigger dispatcher, the event is still delayed until pending ACKs are cleared, and only then goes through the same local flush -> WAITING path
  • AddBlockEventToSink / PassBlockEventToSink are reused as the only block-event sink helpers; whether they need to flush is inferred from dispatcher-local block status instead of a caller-passed flag
  • after maintainer sends Write or Pass, dispatcher does not flush again for that same pending block event, so we avoid duplicate flush / duplicate DDL execution

Maintainer side:

  • remove the explicit Flush action from the heartbeat protocol
  • revert barrier handling back to the pre-#4263 maintainer logic: wait for all influenced dispatchers, send Write to one writer, then send Pass
  • remove the extra flush phase/state that was introduced only for maintainer-side orchestration

In other words, flush still exists semantically, but it becomes a dispatcher-local prerequisite for entering WAITING instead of a maintainer-visible phase.

Check List

Tests

  • Unit test

Questions

Will it cause performance regression or break compatibility?

No expected performance regression.

This changes the internal maintainer/dispatcher barrier protocol by removing the explicit Flush action. It is intended as a same-version change, not a mixed-version rolling-upgrade-compatible protocol extension.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Simplify storage sink barrier handling by moving pre-barrier flush into dispatcher-local handling and reverting maintainer barrier coordination to write/pass.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 7, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lidezhu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 7, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces an asynchronous "prepared" pathway for block events in the dispatcher (writePrepared/passPrepared), moves block-event IO onto a dedicated executor, removes the Flush action from the heartbeat proto, and removes per-barrier flush flags and related flush-phase logic across barrier/maintainer code and tests. Tests and mocks updated to support deferred flush hooks.

Changes

Cohort / File(s) Summary
Dispatcher: prepared/async block path
downstreamadapter/dispatcher/basic_dispatcher.go
Added writePreparedBlockEvent, passPreparedBlockEvent, prepareAndReportBlockedEvent; route Execute/Pass block event flows through prepared path and push IO onto a block-event executor; updated reportBlockedEventDone signature and call sites.
Dispatcher tests & mock sink
downstreamadapter/dispatcher/event_dispatcher_test.go, downstreamadapter/dispatcher/mock_sink_helper_test.go
Added synchronization in TestHoldBlockEventUntilNoResendTasks to defer DB flush; added FlushBeforeBlock hook and setter in dispatcherTestSink to allow test-controlled FlushDMLBeforeBlock behavior.
Protocol: heartbeat
heartbeatpb/heartbeat.proto
Removed Flush = 2 from Action enum (now only Write=0, Pass=1).
Barrier / Maintainer: remove flush flags
maintainer/barrier.go, maintainer/barrier_event.go, maintainer/maintainer_controller_bootstrap.go
Removed flushEnabled and flushDispatcherAdvanced fields/constructors; simplified Barrier/NewBlockEvent signatures and initialization; removed sendFlushAction and flush-phase auto-advance logic; updated bootstrap to call NewBarrier without flush flag.
Barrier tests updated / new test
maintainer/barrier_test.go, maintainer/barrier_event_test.go, maintainer/barrier_phase_test.go, maintainer/maintainer_controller_test.go
Removed flush-related assertions and flush-phase paths; updated NewBarrier call sites to new signature; added TestNewBlockEventInitialPhase and adjusted Resend/phase tests to reflect write/pass-only progression.
Other small edits
maintainer/... (tests/helpers)
Removed flush-specific helpers and adjusted event finish logic to depend only on writerDispatcherAdvanced gating.

Sequence Diagram(s)

sequenceDiagram
    participant Dispatcher
    participant Executor
    participant Sink
    participant Maintainer
    participant Barrier

    Dispatcher->>Executor: enqueue writePreparedBlockEvent(event)
    Executor->>Sink: FlushDMLBeforeBlock(event) (may block via test hook)
    Executor->>Sink: WritePreparedBlockEvent to sink (DDL/SyncPoint path)
    Executor->>Maintainer: reportBlockedEventDone(actionCommitTs, actionIsSyncPoint, ...)
    Maintainer->>Barrier: update BlockEvent status (phase evaluation)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

lgtm, approved

Suggested reviewers

  • wk989898
  • hongyunyan
  • flowbehappy

Poem

🐰 I hop and fiddle with event queues,
Prepared paths hum and executor queues snooze.
Flush gone from enums, flags tucked away,
Writers now write, passers can play.
Hooray — a small hop toward saner days! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title concisely captures the main change: introducing action-matching logic for barrier DONE events to ensure correct phase advancement.
Linked Issues check ✅ Passed The PR implements all four objectives from issue #4398: adds DoneAction to heartbeatpb.State, dispatcher reports typed DONE messages, maintainer validates action-phase matching, and maintains upgrade compatibility via legacy fallback.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the issue objectives: refactoring dispatcher's prepared block-event pathway, removing Flush enum, simplified barrier flush control, and test updates to validate DONE-action matching and legacy fallback behavior.
Description check ✅ Passed The pull request description provides a clear problem statement, detailed explanation of changes on both dispatcher and maintainer sides, addresses all checklist items, and includes a concise release note.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue in the barrier advancement mechanism, particularly in split-table move/merge scenarios. Previously, the system could incorrectly advance a barrier if a duplicated Flush DONE message was misinterpreted as a Write completion. The changes introduce explicit action type encoding within the barrier's DONE status, allowing the maintainer to precisely match completion messages with the current phase of barrier processing. This ensures robust and accurate barrier synchronization, preventing premature advancement and maintaining data consistency.

Highlights

  • Explicit DoneAction: Introduced a new DoneAction enum in heartbeatpb.State to explicitly indicate the type of action (Flush, Write, Pass) that completed a barrier phase.
  • Dispatcher Reporting: Modified the dispatcher to report the specific DoneAction when a block event reaches the DONE stage, ensuring the action type is propagated.
  • Maintainer Logic Refinement: Updated the maintainer's barrier logic to only advance a phase (Flush, Write, Pass) if the reported DoneAction matches the expected action for that phase, preventing premature advancements.
  • Backward Compatibility: Implemented a legacy fallback mechanism in the maintainer to infer the DoneAction for older versions that do not provide this information, ensuring smooth upgrades and mixed-version compatibility.
  • Regression Tests: Added new regression tests to verify correct DoneAction propagation and to ensure stale FlushDone messages are accurately ignored during the write phase.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • downstreamadapter/dispatcher/basic_dispatcher.go
    • Added DoneAction field to heartbeatpb.State when reporting dispatcher status.
    • Modified reportBlockedEventDone to accept and propagate the specific DoneAction (WriteDone, PassDone, FlushDone).
    • Introduced doneActionFromAction to map heartbeatpb.Action to heartbeatpb.DoneAction.
  • downstreamadapter/dispatcher/event_dispatcher_test.go
    • Added TestFlushDoneStatusCarriesDoneAction to verify DoneAction propagation and handling of duplicate flush messages.
  • heartbeatpb/heartbeat.pb.go
    • Added comments to Action enum values.
    • Defined new DoneAction enum and integrated it into State message.
    • Updated protobuf serialization/deserialization logic for State to include DoneAction.
  • heartbeatpb/heartbeat.proto
    • Defined new DoneAction enum.
    • Added done_action field to State message.
  • maintainer/barrier.go
    • Modified handleEventDone to use DoneAction for precise phase advancement (Flush, Write, Pass).
    • Implemented inferLegacyDoneAction for backward compatibility.
  • maintainer/barrier_test.go
    • Added TestBarrierIgnoresFlushDoneWhileWaitingWrite to ensure correct barrier phase advancement by ignoring stale FlushDone messages.
Activity
  • No human activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@3AceShowHand
Copy link
Collaborator Author

/test all

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical bug where a duplicated Flush DONE message could be misinterpreted as a Write completion, leading to premature barrier advancement. The introduction of the DoneAction enum to explicitly tag the source of a DONE status is a robust solution. The changes are well-implemented across the dispatcher and maintainer, and the inclusion of backward compatibility for mixed-version clusters is thoughtful. The new unit tests, especially TestBarrierIgnoresFlushDoneWhileWaitingWrite, are excellent and directly verify the fix for the described race condition. I have one suggestion to improve the clarity of the legacy action inference logic.

Note: Security Review did not run due to the size of the PR.

@3AceShowHand
Copy link
Collaborator Author

/test all

3 similar comments
@3AceShowHand
Copy link
Collaborator Author

/test all

@3AceShowHand
Copy link
Collaborator Author

/test all

@3AceShowHand
Copy link
Collaborator Author

/test all

@3AceShowHand
Copy link
Collaborator Author

/retest

@3AceShowHand
Copy link
Collaborator Author

/test all

@3AceShowHand
Copy link
Collaborator Author

/check-issue-triage-complete

@3AceShowHand
Copy link
Collaborator Author

/test all

1 similar comment
@3AceShowHand
Copy link
Collaborator Author

/test all

@3AceShowHand
Copy link
Collaborator Author

/test all

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@maintainer/barrier_phase_test.go`:
- Around line 1-13: The file barrier_phase_test.go is missing the standard
Apache 2.0 copyright header causing CI failure; add the same multi-line
copyright header used across the repo at the very top of barrier_phase_test.go
(before the package statement) so that package maintainer and its test files
match the project's header format.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1f971d28-bf78-4add-b07e-5b0db0af6ad0

📥 Commits

Reviewing files that changed from the base of the PR and between 9e9cf90 and 3f8026a.

📒 Files selected for processing (6)
  • maintainer/barrier.go
  • maintainer/barrier_event.go
  • maintainer/barrier_event_test.go
  • maintainer/barrier_phase_test.go
  • maintainer/barrier_test.go
  • maintainer/maintainer_controller_bootstrap.go

@3AceShowHand
Copy link
Collaborator Author

/test all

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 10, 2026
@3AceShowHand
Copy link
Collaborator Author

/test all

// It relies on PassBlockEventToSink to preserve ordering and mark the event passed.
func (d *BasicDispatcher) PassBlockEvent(pendingEvent commonEvent.BlockEvent, actionCommitTs uint64, actionIsSyncPoint bool) {
failpoint.Inject("BlockOrWaitBeforePass", nil)
err := d.PassBlockEventToSink(pendingEvent)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between PassBlockEventToSink and the passPreparedBlockEvent function?

d.PassBlockEvent(pendingEvent, actionCommitTs, actionIsSyncPoint)
})
return true
case heartbeatpb.Action_Flush:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

// It is required for storage sink when split-table is enabled: one table may have multiple
// dispatchers on different nodes, and every dispatcher must flush pre-barrier
// DML first to prevent it from being reordered after the writer executes DDL.
Flush = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this cause compatibility issues during the upgrade process?

@3AceShowHand
Copy link
Collaborator Author

/test all

1 similar comment
@3AceShowHand
Copy link
Collaborator Author

/test all

@3AceShowHand
Copy link
Collaborator Author

/test all

@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 10, 2026
@3AceShowHand
Copy link
Collaborator Author

/test all

@3AceShowHand
Copy link
Collaborator Author

/test all

@3AceShowHand
Copy link
Collaborator Author

/test all

2 similar comments
@3AceShowHand
Copy link
Collaborator Author

/test all

@3AceShowHand
Copy link
Collaborator Author

/test all

@3AceShowHand
Copy link
Collaborator Author

/retest

@3AceShowHand
Copy link
Collaborator Author

/test all

@3AceShowHand
Copy link
Collaborator Author

/test all

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 10, 2026

@3AceShowHand: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cdc-storage-integration-heavy 9adef26 link true /test pull-cdc-storage-integration-heavy

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

storage sink: barrier can skip DDL write during split-table move/merge

2 participants