Skip to content

fix(*): merge operator inconsistent after maintainer move#3769

Open
wlwilliamx wants to merge 17 commits intopingcap:masterfrom
wlwilliamx:fix/merge-operator-inconsistent-after-maintainer-move
Open

fix(*): merge operator inconsistent after maintainer move#3769
wlwilliamx wants to merge 17 commits intopingcap:masterfrom
wlwilliamx:fix/merge-operator-inconsistent-after-maintainer-move

Conversation

@wlwilliamx
Copy link
Collaborator

@wlwilliamx wlwilliamx commented Dec 23, 2025

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test

Questions

Will it cause performance regression or break compatibility?

None

Do you need to update user documentation, design documentation or monitoring documentation?

None

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability of merge operations during maintainer failover by ensuring in-flight merges are properly tracked and restored after a failover event.
    • Enhanced bootstrap recovery to correctly rebuild and resume merge operations that were in progress at the time of failover.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Dec 23, 2025
@ti-chi-bot
Copy link

ti-chi-bot bot commented Dec 23, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hongyunyan for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 23, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @wlwilliamx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and consistency of dispatcher operations, particularly in distributed environments where maintainer nodes might restart or fail over. By introducing explicit tracking of different operator types and implementing comprehensive recovery logic for in-flight operations during bootstrap, the system can now maintain a more accurate state and seamlessly resume complex tasks like merging dispatchers, even after disruptions. This change addresses potential inconsistencies that could arise from untracked operations, leading to a more reliable and fault-tolerant system.

Highlights

  • Operator Type Tracking: Introduced a new OperatorType enum (Add, Remove, Move, Split, Merge) and integrated it into ScheduleDispatcherRequest messages, allowing for more granular tracking of dispatcher operations.
  • In-flight Operator Recovery: Implemented mechanisms to track and restore in-flight operators, including merge operations, during system bootstrap. This ensures that ongoing operations are not lost and can resume correctly after a node restart or maintainer failover.
  • Dispatcher Manager Enhancements: Added currentOperatorMap, redoCurrentOperatorMap, and mergeOperatorMap to the DispatcherManager to store active operators and merge requests. Logic was added to manage these maps during dispatcher lifecycle events and message processing.
  • Bootstrap Response Updates: The MaintainerBootstrapResponse now includes lists of Operators and MergeOperators, enabling the maintainer to reconstruct the state of ongoing operations upon recovery.
  • Improved Operator Handling Logic: Refined the handling of add, remove, move, and split operators to correctly utilize the new OperatorType and ensure proper state management, especially during concurrent scheduling and cleanup.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@wlwilliamx
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-heavy

@wlwilliamx
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-light

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to track and restore in-flight dispatcher operations (add, remove, move, split, merge) during maintainer failover and bootstrap. Key changes include adding currentOperatorMap, redoCurrentOperatorMap, and mergeOperatorMap to DispatcherManager to store ongoing operations, and updating the protobuf definitions to include OperatorType and lists of in-flight operators in the MaintainerBootstrapResponse. The HeartBeatCollector now tracks merge operators, and the SchedulerDispatcherRequestHandler prevents concurrent operations on the same span by checking these new operator maps. During bootstrap, the maintainer now restores these in-flight operators. Review comments highlight that the OperatorType should be correctly propagated and not hardcoded, especially for move and split operations, and suggest simplifying the concurrent operator check logic by potentially unifying the currentOperatorMap and redoCurrentOperatorMap.

case heartbeatpb.ScheduleAction_Create:
switch req.OperatorType {
case heartbeatpb.OperatorType_O_Add, heartbeatpb.OperatorType_O_Move, heartbeatpb.OperatorType_O_Split:
op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, heartbeatpb.OperatorType_O_Add)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When restoring an add operator, the original operator type from the request (req.OperatorType) should be preserved. Hardcoding OperatorType_O_Add here will cause move and split operators to be incorrectly restored as simple add operators, breaking the operator restoration logic.

Suggested change
op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, heartbeatpb.OperatorType_O_Add)
op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, req.OperatorType)

Comment on lines +155 to +157
return m.replicaSet.NewAddDispatcherMessage(m.dest, heartbeatpb.OperatorType_O_Add)
case moveStateRemoveOrigin, moveStateAbortRemoveOrigin:
return m.replicaSet.NewRemoveDispatcherMessage(m.origin)
return m.replicaSet.NewRemoveDispatcherMessage(m.origin, heartbeatpb.OperatorType_O_Remove)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The add and remove parts of a move operation should both be typed as O_Move. Using O_Add and O_Remove is incorrect and will break operator restoration logic on maintainer failover, as the new maintainer will not recognize these as parts of a single move operation.

Suggested change
return m.replicaSet.NewAddDispatcherMessage(m.dest, heartbeatpb.OperatorType_O_Add)
case moveStateRemoveOrigin, moveStateAbortRemoveOrigin:
return m.replicaSet.NewRemoveDispatcherMessage(m.origin)
return m.replicaSet.NewRemoveDispatcherMessage(m.origin, heartbeatpb.OperatorType_O_Remove)
return m.replicaSet.NewAddDispatcherMessage(m.dest, heartbeatpb.OperatorType_O_Move)
case moveStateRemoveOrigin, moveStateAbortRemoveOrigin:
return m.replicaSet.NewRemoveDispatcherMessage(m.origin, heartbeatpb.OperatorType_O_Move)

Comment on lines +227 to +246
_, exists := dispatcherManager.currentOperatorMap.Load(operatorKey)
if exists {
log.Warn("operator key exists, skip this request",
zap.String("changefeedID", req.ChangefeedID.String()),
zap.String("dispatcherID", common.NewDispatcherIDFromPB(req.Config.DispatcherID).String()),
zap.String("operatorKey", operatorKey),
zap.Any("operator", req),
)
continue
}
_, redoExists := dispatcherManager.redoCurrentOperatorMap.Load(operatorKey)
if redoExists {
log.Warn("redo operator key exists, skip this request",
zap.String("changefeedID", req.ChangefeedID.String()),
zap.String("dispatcherID", common.NewDispatcherIDFromPB(req.Config.DispatcherID).String()),
zap.String("operatorKey", operatorKey),
zap.Any("operator", req),
)
continue
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to prevent concurrent operators on the same span or dispatcher is split between checking currentOperatorMap and redoCurrentOperatorMap. This could be simplified by using a single map for both, with a composite key or value to distinguish between redo and normal modes. This would reduce code duplication and make the logic easier to follow.

@wlwilliamx
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-heavy

@wlwilliamx
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-light

@ti-chi-bot
Copy link

ti-chi-bot bot commented Dec 23, 2025

@wlwilliamx: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cdc-mysql-integration-heavy b8473b1 link true /test pull-cdc-mysql-integration-heavy

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

…r-inconsistent-after-maintainer-move

# Conflicts:
#	downstreamadapter/dispatchermanager/dispatcher_manager.go
#	downstreamadapter/dispatchermanager/dispatcher_manager_info.go
#	downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
#	downstreamadapter/dispatchermanager/helper.go
#	downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
#	heartbeatpb/heartbeat.pb.go
#	heartbeatpb/heartbeat.proto
#	maintainer/maintainer_controller.go
#	maintainer/maintainer_controller_bootstrap.go
#	maintainer/maintainer_controller_helper.go
#	maintainer/maintainer_manager_test.go
#	maintainer/maintainer_test.go
#	maintainer/operator/operator_add.go
#	maintainer/operator/operator_move.go
#	maintainer/operator/operator_remove.go
#	maintainer/replica/replication_span.go
#	maintainer/replica/replication_span_test.go
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

This PR introduces in-flight merge operator tracking to enable bootstrap recovery after maintainer failover. It adds a mergeOperatorMap to DispatcherManager, new public methods for tracking and querying merge operations, integration into heartbeat collection and bootstrap responses, and bootstrap restoration logic to recreate merge operations from persisted state.

Changes

Cohort / File(s) Summary
DispatcherManager Core Tracking
downstreamadapter/dispatchermanager/dispatcher_manager.go, downstreamadapter/dispatchermanager/dispatcher_manager_merge.go
Added mergeOperatorMap field to store in-flight merge requests. Introduces four public methods (TrackMergeOperator, RemoveMergeOperator, MaybeCleanupMergeOperator, GetMergeOperators) and a helper to clone merge requests for safe storage.
Heartbeat Integration
downstreamadapter/dispatchermanager/heartbeat_collector.go, downstreamadapter/dispatchermanager/helper.go
Added dispatcherManagers map in HeartBeatCollector to link DispatcherManager instances. Wires tracking calls into merge dispatch flow and adds nil-check with cleanup invocation on failed merge operations.
Merge Operator Lifecycle
downstreamadapter/dispatchermanager/task.go
Added cleanup calls to RemoveMergeOperator in abortMerge and doMerge paths when merge operations complete.
Bootstrap Response
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go, heartbeatpb/heartbeat.proto
Extended MaintainerBootstrapResponse with new merge_operators field to carry in-flight merge requests for bootstrap recovery.
Bootstrap Restoration
maintainer/maintainer_controller_bootstrap.go
Added comprehensive merge operator restoration workflow with new helpers (buildTableSplitMap, buildMergedSpanFromReplicas). Reconstructs span replications, validates preconditions, and restores merge operations with proper state synchronization.
Operator Management
maintainer/operator/operator_controller.go, maintainer/operator/operator_merge.go
Added AddRestoredMergeOperator method to Controller and NewRestoredMergeDispatcherOperator constructor to synthesize and restore merge operators from existing replica sets during bootstrap recovery.

Sequence Diagram(s)

sequenceDiagram
    participant Helper as Merge Helper
    participant DM as DispatcherManager
    participant HC as HeartBeatCollector
    participant DO as DispatcherOrchestrator
    participant Maintainer as Maintainer Controller

    Helper->>DM: TrackMergeOperator(request)
    activate DM
    DM->>DM: Store request in mergeOperatorMap
    deactivate DM

    HC->>DM: GetMergeOperators()
    activate DM
    DM-->>HC: Return tracked requests
    deactivate DM

    DO->>DM: GetMergeOperators()
    activate DM
    DM-->>DO: Return tracked requests
    deactivate DM

    DO->>Maintainer: BootstrapResponse with merge_operators
    activate Maintainer
    Maintainer->>Maintainer: Restore merge operators from state
    Maintainer->>Maintainer: Create SpanReplications
    Maintainer->>Maintainer: Invoke AddRestoredMergeOperator
    Maintainer->>Maintainer: Start restored merge operation
    deactivate Maintainer
Loading
sequenceDiagram
    participant Task as Merge Task
    participant DM as DispatcherManager
    participant Cleanup as Cleanup Handler

    Task->>Task: doMerge() completes
    activate Task
    Task->>DM: RemoveMergeOperator(mergedDispatcherID)
    activate DM
    DM->>DM: Remove from mergeOperatorMap
    deactivate DM
    Task->>Cleanup: Mark merge complete
    deactivate Task

    Task->>Task: abortMerge() executes
    activate Task
    Task->>DM: RemoveMergeOperator(mergedDispatcherID)
    activate DM
    DM->>DM: Remove from mergeOperatorMap
    deactivate DM
    Task->>Cleanup: Log abort event
    deactivate Task
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

lgtm, approved

Suggested reviewers

  • hongyunyan
  • flowbehappy
  • lidezhu
  • wk989898

Poem

🐰 The merge operators now remember their dance,
Tracking in-flight through each restart's chance,
Bootstrap restores what was lost in the fray,
Failover no longer sweeps work away!
Through heartbeats and helpers, the state marches on,
A distributed tango, forever redrawn.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description uses the template but leaves critical sections as unfilled placeholders: the Issue Number field contains only 'close #xxx' without an actual issue reference, and 'What is changed and how it works?' is completely empty despite extensive code changes. Fill in the actual issue number being closed and provide a detailed explanation of what changed and how the fix works to restore merge operator consistency after maintainer failover.
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(*): merge operator inconsistent after maintainer move' clearly summarizes the main change, directly addressing the core issue being fixed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 10, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
downstreamadapter/dispatchermanager/heartbeat_collector.go (1)

294-296: Duplicate tracking—consider consolidating.

This TrackMergeOperator call is redundant with the one in MergeDispatcherRequestHandler.Handle (helper.go, line 788). Both track the same request. While idempotent and harmless, consolidating to a single call would reduce confusion. Consider removing this early tracking and relying solely on the handler, or vice versa.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatchermanager/heartbeat_collector.go` around lines 294
- 296, The call to DispatcherManager.TrackMergeOperator in the heartbeat
collector is a duplicate of the tracking performed in
MergeDispatcherRequestHandler.Handle (MergeDispatcherRequestHandler.Handle
already tracks the same merge request); remove the early TrackMergeOperator
invocation from the heartbeat_collector code path (the
manager.(*DispatcherManager).TrackMergeOperator(mergeDispatcherRequest) call) so
the single canonical tracking remains in MergeDispatcherRequestHandler.Handle,
keeping idempotency while avoiding confusing redundant calls.
downstreamadapter/dispatchermanager/helper.go (1)

788-800: Redundant TrackMergeOperator call.

TrackMergeOperator is already invoked in RecvMessages (heartbeat_collector.go, line 295) before pushing to the dynamic stream. Calling it again here is idempotent but unnecessary. Consider removing one of the calls to avoid confusion—keeping it in the handler (here) is preferable since it's closer to the actual merge logic and ensures tracking even if the request arrives through a different path.

The nil-check and MaybeCleanupMergeOperator call for failed merges is correct and ensures no stale entries remain in mergeOperatorMap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatchermanager/helper.go` around lines 788 - 800,
Redundant call to TrackMergeOperator: remove the earlier invocation in
RecvMessages (heartbeat_collector.go) so that tracking happens only in this
handler before calling MergeDispatcher; keep the
TrackMergeOperator(dispatcherManager.TrackMergeOperator(mergeDispatcherRequest.MergeDispatcherRequest))
call here, ensure MergeDispatcher(...) and the nil-check that calls
MaybeCleanupMergeOperator(...) remain unchanged, and run tests to confirm no
behavior change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@downstreamadapter/dispatchermanager/dispatcher_manager_merge.go`:
- Around line 24-29: The TrackMergeOperator method currently stores requests
whose protobuf ID decodes to a zero-valued DispatcherID, causing all such
requests to collide; before storing into e.mergeOperatorMap call
mergedID.IsZero() and return if true (mirror the guard used in
MaybeCleanupMergeOperator), and do the same check wherever mergeOperatorMap is
written (e.g., in the second occurrence around MaybeCleanupMergeOperator
handling) so that cloneMergeDispatcherRequest(req) and the
mergeOperatorMap.Store only run for non-zero mergedID values.

In `@maintainer/maintainer_controller_bootstrap.go`:
- Around line 133-136: The restoreCurrentMergeOperators call always uses
buildTableSplitMap(tables) but must use a mode-specific split map because
restoreCurrentMergeOperators rebuilds merges based on mergeReq.Mode; when
redoStartTs != startTs redo-only tables need the redo-mode splitEnabled values.
Modify the bootstrap path that calls restoreCurrentMergeOperators so it computes
and passes a split map appropriate for the merge mode (e.g., choose
buildTableSplitMap(tables) for default mode and buildTableSplitMap(redoTables)
or a map derived for redo mode when mergeReq.Mode indicates redo), ensuring
restoreCurrentMergeOperators receives the mode-specific split map rather than
the always-normal tables view.
- Around line 956-963: The code assumes spanInfo and mergedSpanInfo are non-nil
and dereferences spanInfo.Span.TableID (and mergedSpanInfo.Span.TableID) causing
panics for malformed bootstrap entries; update the bootstrap handling (where
indexBootstrapSpans / spanInfoByID are read) to nil-check the outer entry and
its inner Span before accessing fields: e.g., before using spanInfo.Span.TableID
or passing spanInfo to
spanController.ShouldEnableSplit/createSpanReplication/AddReplicatingSpan,
return or skip the entry (set sourceComplete=false or continue) when
spanInfo==nil || spanInfo.Span==nil (and similarly for mergedSpanInfo), and
ensure any subsequent logic that relies on a non-nil Span only runs after these
guards.

---

Nitpick comments:
In `@downstreamadapter/dispatchermanager/heartbeat_collector.go`:
- Around line 294-296: The call to DispatcherManager.TrackMergeOperator in the
heartbeat collector is a duplicate of the tracking performed in
MergeDispatcherRequestHandler.Handle (MergeDispatcherRequestHandler.Handle
already tracks the same merge request); remove the early TrackMergeOperator
invocation from the heartbeat_collector code path (the
manager.(*DispatcherManager).TrackMergeOperator(mergeDispatcherRequest) call) so
the single canonical tracking remains in MergeDispatcherRequestHandler.Handle,
keeping idempotency while avoiding confusing redundant calls.

In `@downstreamadapter/dispatchermanager/helper.go`:
- Around line 788-800: Redundant call to TrackMergeOperator: remove the earlier
invocation in RecvMessages (heartbeat_collector.go) so that tracking happens
only in this handler before calling MergeDispatcher; keep the
TrackMergeOperator(dispatcherManager.TrackMergeOperator(mergeDispatcherRequest.MergeDispatcherRequest))
call here, ensure MergeDispatcher(...) and the nil-check that calls
MaybeCleanupMergeOperator(...) remain unchanged, and run tests to confirm no
behavior change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0a0daa28-8fc8-4a2c-b916-05a14a5723d0

📥 Commits

Reviewing files that changed from the base of the PR and between ed4bfaa and ba7dd6d.

⛔ Files ignored due to path filters (1)
  • heartbeatpb/heartbeat.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (10)
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_merge.go
  • downstreamadapter/dispatchermanager/heartbeat_collector.go
  • downstreamadapter/dispatchermanager/helper.go
  • downstreamadapter/dispatchermanager/task.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
  • heartbeatpb/heartbeat.proto
  • maintainer/maintainer_controller_bootstrap.go
  • maintainer/operator/operator_controller.go
  • maintainer/operator/operator_merge.go

Comment on lines +24 to +29
func (e *DispatcherManager) TrackMergeOperator(req *heartbeatpb.MergeDispatcherRequest) {
if req == nil || req.MergedDispatcherID == nil {
return
}
mergedID := common.NewDispatcherIDFromPB(req.MergedDispatcherID)
e.mergeOperatorMap.Store(mergedID.String(), cloneMergeDispatcherRequest(req))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject zero-valued merged dispatcher IDs before touching mergeOperatorMap.

A non-nil protobuf ID can still decode to DispatcherID{}. Right now those requests all collapse to the same key, and the bootstrap side later treats that zero ID as a real merged dispatcher, so one malformed request can poison recovery state.

🔒 Suggested guard
 func (e *DispatcherManager) TrackMergeOperator(req *heartbeatpb.MergeDispatcherRequest) {
 	if req == nil || req.MergedDispatcherID == nil {
 		return
 	}
 	mergedID := common.NewDispatcherIDFromPB(req.MergedDispatcherID)
+	if mergedID.IsZero() {
+		log.Warn("merge operator has invalid merged dispatcher ID",
+			zap.Stringer("changefeedID", e.changefeedID))
+		return
+	}
 	e.mergeOperatorMap.Store(mergedID.String(), cloneMergeDispatcherRequest(req))
 }

Apply the same mergedID.IsZero() guard in MaybeCleanupMergeOperator.

Also applies to: 38-43

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatchermanager/dispatcher_manager_merge.go` around lines
24 - 29, The TrackMergeOperator method currently stores requests whose protobuf
ID decodes to a zero-valued DispatcherID, causing all such requests to collide;
before storing into e.mergeOperatorMap call mergedID.IsZero() and return if true
(mirror the guard used in MaybeCleanupMergeOperator), and do the same check
wherever mergeOperatorMap is written (e.g., in the second occurrence around
MaybeCleanupMergeOperator handling) so that cloneMergeDispatcherRequest(req) and
the mergeOperatorMap.Store only run for non-zero mergedID values.

Comment on lines +133 to +136
// Restore merge operators after task state is rebuilt from bootstrap spans/operators.
// Merge restoration needs the per-dispatcher task map from buildTaskInfo, but must run
// before we discard any leftover working tasks as dropped-table artifacts.
if err := c.restoreCurrentMergeOperators(allNodesResp, buildTableSplitMap(tables)); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Pass a mode-specific split map into merge restoration.

Line 136 always calls restoreCurrentMergeOperators with the normal tables view, but the restore path rebuilds both default and redo merges via mergeReq.Mode. If redoStartTs and startTs diverge, redo-only tables get recreated with the wrong splitEnabled flag after failover.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maintainer/maintainer_controller_bootstrap.go` around lines 133 - 136, The
restoreCurrentMergeOperators call always uses buildTableSplitMap(tables) but
must use a mode-specific split map because restoreCurrentMergeOperators rebuilds
merges based on mergeReq.Mode; when redoStartTs != startTs redo-only tables need
the redo-mode splitEnabled values. Modify the bootstrap path that calls
restoreCurrentMergeOperators so it computes and passes a split map appropriate
for the merge mode (e.g., choose buildTableSplitMap(tables) for default mode and
buildTableSplitMap(redoTables) or a map derived for redo mode when mergeReq.Mode
indicates redo), ensuring restoreCurrentMergeOperators receives the
mode-specific split map rather than the always-normal tables view.

Comment on lines +956 to +963
spanInfo := spanInfoByID[dispatcherID]
if spanInfo == nil {
sourceComplete = false
break
}
splitEnabled := spanController.ShouldEnableSplit(tableSplitMap[spanInfo.Span.TableID])
replicaSet = c.createSpanReplication(spanInfo, nodeID, splitEnabled)
spanController.AddReplicatingSpan(replicaSet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard nil Span before dereferencing bootstrap entries.

indexBootstrapSpans only filters nil IDs, not nil Spans. The new recovery path then reads spanInfo.Span.TableID / mergedSpanInfo.Span.TableID in several places, so one malformed bootstrap snapshot can panic the maintainer during bootstrap.

🛡️ Suggested guard
 				replicaSet := spanController.GetTaskByID(dispatcherID)
 				if replicaSet == nil {
 					spanInfo := spanInfoByID[dispatcherID]
-					if spanInfo == nil {
+					if spanInfo == nil || spanInfo.Span == nil {
 						sourceComplete = false
 						break
 					}
 					splitEnabled := spanController.ShouldEnableSplit(tableSplitMap[spanInfo.Span.TableID])
 					replicaSet = c.createSpanReplication(spanInfo, nodeID, splitEnabled)
@@
-			mergedSpanInfo := spanInfoByID[mergedDispatcherID]
+			mergedSpanInfo := spanInfoByID[mergedDispatcherID]
+			if mergedSpanInfo != nil && mergedSpanInfo.Span == nil {
+				log.Warn("merge operator missing merged span, skip restoring it",
+					zap.String("nodeID", nodeID.String()),
+					zap.String("changefeed", resp.ChangefeedID.String()),
+					zap.String("dispatcher", mergedDispatcherID.String()))
+				continue
+			}

Also applies to: 972-974, 1001-1004, 1025-1027

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maintainer/maintainer_controller_bootstrap.go` around lines 956 - 963, The
code assumes spanInfo and mergedSpanInfo are non-nil and dereferences
spanInfo.Span.TableID (and mergedSpanInfo.Span.TableID) causing panics for
malformed bootstrap entries; update the bootstrap handling (where
indexBootstrapSpans / spanInfoByID are read) to nil-check the outer entry and
its inner Span before accessing fields: e.g., before using spanInfo.Span.TableID
or passing spanInfo to
spanController.ShouldEnableSplit/createSpanReplication/AddReplicatingSpan,
return or skip the entry (set sourceComplete=false or continue) when
spanInfo==nil || spanInfo.Span==nil (and similarly for mergedSpanInfo), and
ensure any subsequent logic that relies on a non-nil Span only runs after these
guards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant