Skip to content

dispatcher,dispatchermanager: deduplicate pending block statuses#5028

Open
hongyunyan wants to merge 5 commits into
pingcap:masterfrom
hongyunyan:codex/pr-4814-minus-reverted
Open

dispatcher,dispatchermanager: deduplicate pending block statuses#5028
hongyunyan wants to merge 5 commits into
pingcap:masterfrom
hongyunyan:codex/pr-4814-minus-reverted

Conversation

@hongyunyan
Copy link
Copy Markdown
Collaborator

@hongyunyan hongyunyan commented May 12, 2026

What problem does this PR solve?

Issue Number: ref #0

This PR recreates the useful part of #4814 on top of the latest pingcap/ticdc:master, while excluding the effects of the following commits from that PR branch:

  • 632330358254d147d548d568711c62a9ee509e4d
  • 6cc8b8280cc7da0996c6c2433987fe6b73bf0d32

Background:

  • In large table-count DDL scenarios, dispatcher and maintainer status paths can accumulate many repeated pending block statuses.
  • Repeated DONE / WAITING statuses amplify local queue size and memory use before they are drained or sent.
  • The remaining changes keep the deduplication/buffering direction from dispatcher,dispatchermanager: deduplicate pending done statuses #4814, but remove the skipped syncpoint cleanup/shortcut changes and preserve the fanout pass resend quiet behavior.

What is changed and how it works?

Summary:

  • Add BlockStatusBuffer between dispatchers and dispatcher manager to keep ordering while coalescing identical pending WAITING and DONE statuses.
  • Delay DONE protobuf materialization until the dispatcher manager drains the local buffer.
  • Route dispatcher resend tasks through the same block status buffer.
  • Batch and split block status requests in dispatcher manager so one request does not grow without bound.
  • Add queued/in-flight dedupe in BlockStatusRequestQueue, with explicit send-complete cleanup.
  • Buffer maintainer status requests outside the normal event channel and coalesce heartbeat, block status, and redo resolved-ts updates.
  • Keep maintainer heartbeat watermark merge semantics aligned with onHeartbeatRequest, including strict same-sequence checkpoint handling and max LastSyncedTs preservation.
  • Preserve DB/All fanout pass quiet resend behavior and tests.

Check List

Tests

  • Unit test

    • make fmt
    • git diff --check upstream/master...HEAD
    • go test -count=1 ./maintainer
    • go test -count=1 ./downstreamadapter/dispatchermanager
    • go test -count=1 ./downstreamadapter/dispatcher -run 'TestBlockStatusBuffer|TestIgnoredBlockStatus|TestBlockingDDLFlushBeforeWaitingAndWriteDoesNotFlushAgain|TestHandleDispatcherStatus|TestDealWithBlockEvent'

Note: go test -count=1 --tags=intest ./downstreamadapter/dispatcher -run '^TestRedoBatchDMLEventsPartialFlush$' fails in this local environment on both this branch and latest upstream/master because /tmp/tidb is owned by root:root with mode 700, causing stat /tmp/tidb/tmp_ddl-4000: permission denied during TiDB DDL initialization.

Questions

Will it cause performance regression or break compatibility?

No protocol or persistent format change is introduced. The change coalesces duplicate pending local statuses and bounds block status request size. It is intended to reduce memory pressure and local queue amplification.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

None

Summary by CodeRabbit

  • Bug Fixes

    • Deduplicated identical WAITING and DONE block-status messages to prevent duplicates.
    • Fixed oversized block-status batches by splitting them into multiple requests.
  • Improvements

    • Added buffering and coalescing of block-status messages for more efficient delivery and reduced noise.
    • Throttled fanout pass-action resends with a quiet-period to reduce redundant updates.
    • Improved metrics for monitoring block-status queue length and flow.
  • Tests

    • Added unit tests covering buffer deduplication, ordering, and request-queue dedupe.

Review Change Stack

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. labels May 12, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign flowbehappy for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 12, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

Warning

Rate limit exceeded

@hongyunyan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 39 minutes and 11 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b5a6b0d7-2c32-4bd9-977d-114d8a14d9e5

📥 Commits

Reviewing files that changed from the base of the PR and between 2d690a7 and b363f40.

📒 Files selected for processing (7)
  • downstreamadapter/dispatcher/basic_dispatcher.go
  • downstreamadapter/dispatcher/basic_dispatcher_info.go
  • downstreamadapter/dispatcher/block_status_buffer.go
  • downstreamadapter/dispatcher/block_status_buffer_test.go
  • downstreamadapter/dispatcher/event_dispatcher_test.go
  • downstreamadapter/dispatcher/helper.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_test.go
📝 Walkthrough

Walkthrough

Adds a bounded BlockStatusBuffer that deduplicates pending WAITING/DONE statuses and replaces direct channel usage with Offer/Take APIs on SharedInfo/Dispatcher; adds BlockStatusRequestQueue deduplication with in-flight tracking and batching/splitting in dispatcher manager; adds fanout pass-action quiet-interval throttling in BarrierEvent and related tests.

Changes

Block Status Buffer Infrastructure and Dispatcher Refactoring

Layer / File(s) Summary
Block status buffer implementation with deduplication
downstreamadapter/dispatcher/block_status_buffer.go, downstreamadapter/dispatcher/block_status_buffer_test.go
BlockStatusBuffer stores messages in a bounded channel and coalesces duplicate pending WAITING/DONE statuses via reservation keys; provides Offer, blocking Take(ctx), and Len() plus reservation lifecycle; unit tests validate dedupe, re-offer after take, ordering, and distinct-key handling.
SharedInfo and Dispatcher Offer/Take APIs
downstreamadapter/dispatcher/basic_dispatcher_info.go, downstreamadapter/dispatcher/basic_dispatcher.go
SharedInfo replaces blockStatusesChan with blockStatusBuffer and exposes OfferBlockStatus, TakeBlockStatus(ctx), and BlockStatusLen; BasicDispatcher adds wrapper Offer/Take methods and Dispatcher interface gains OfferBlockStatus; GetBlockStatusesChan is removed.
Block status delivery refactor across dispatcher
downstreamadapter/dispatcher/basic_dispatcher.go, downstreamadapter/dispatcher/helper.go, downstreamadapter/dispatcher/event_dispatcher.go
Dispatcher code paths (HandleDispatcherStatus, reportBlockedEventDone, DealWithBlockEvent, reportBlockedEventToMaintainer, ResendTask.Execute) now construct BlockStatusEntry values and call OfferBlockStatus instead of writing directly to the shared channel; small comment trim in event_dispatcher.
Event dispatcher tests updated
downstreamadapter/dispatcher/event_dispatcher_test.go
Tests updated to build the new SharedInfo/buffer, use takeBlockStatusWithTimeout helper, and replace direct channel selects with timeout-based assertions across blocking/flush-related tests.

Request Queue Deduplication and Dispatcher Manager Integration

Layer / File(s) Summary
Block status request queue deduplication with in-flight tracking
downstreamadapter/dispatchermanager/heartbeat_queue.go, downstreamadapter/dispatchermanager/heartbeat_queue_test.go
BlockStatusRequestQueue adds mutex-protected pending and in-flight maps and dedupe-keying to suppress duplicate WAITING/DONE statuses already queued or in-flight; Enqueue filters statuses in-place; Dequeue marks keys in-flight; OnSendComplete clears in-flight state; tests verify queued/in-flight dedupe and key distinctions.
Dispatcher manager block-status batching and collection
downstreamadapter/dispatchermanager/dispatcher_manager.go, downstreamadapter/dispatchermanager/dispatcher_manager_test.go, downstreamadapter/dispatchermanager/heartbeat_collector.go
Introduce maxBlockStatusesPerRequest (2048); collectBlockStatusRequest drains statuses via TakeBlockStatus with 10ms windows, separates default/redo modes, updates the block-status queue length metric via BlockStatusLen(), and splits oversized batches via an enqueue helper; sendBlockStatusMessages calls OnSendComplete after each request; tests validate splitting behavior.

Barrier Event Fanout Pass-Action Throttling and Maintainer Logging

Layer / File(s) Summary
Fanout pass-action throttling with quiet-interval gating
maintainer/barrier_event.go
Add fanoutPassResendQuietInterval, track lastStatusReceivedTime and passActionSent on BarrierEvent; markStatusReceived() updates receipt time; resend logic throttles fanout (All/DB) pass resends within the quiet interval and resets passActionSent when writer is selected; minor log wording updates.
Barrier event fanout throttling validation tests
maintainer/barrier_event_test.go
Two tests assert fanout pass resends wait for the quiet interval after status receipt and that normal pass resends are not gated when passActionSent is set.
Maintainer event logging simplification
maintainer/maintainer.go
Simplify slow-event logging for EventMessage to structured fields from, to, type, topic instead of dumping the full message object.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • pingcap/ticdc#4663: Both PRs modify the same dispatcher API and implementation (SharedInfo/NewSharedInfo and BasicDispatcher in downstreamadapter/dispatcher/*), changing constructor parameters and dispatcher accessors, so they are related at the code level.
  • pingcap/ticdc#4389: Modifies BasicDispatcher block-event and reportBlockedEventDone flows; related to centralizing block-status offers.

Suggested labels

lgtm, approved

Suggested reviewers

  • wk989898
  • flowbehappy

🐰 I buffered the beats of a dispatcher's heart,
Quieted fanouts till responses start.
Duplicates vanish, requests line and blend,
Tests confirm order, and resends mend.
Hop, review, and merge — from rabbit, cheers!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.31% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'dispatcher,dispatchermanager: deduplicate pending block statuses' is specific, directly related to the main change (deduplication of block statuses), and clearly communicates the primary focus of the changeset.
Description check ✅ Passed The PR description addresses the template requirements: it identifies the problem (repeated pending block statuses in large DDL scenarios), explains what changed (addition of BlockStatusBuffer, batching, deduplication), includes testing notes, and answers compatibility questions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the handling of dispatcher block statuses and heartbeats by introducing a BlockStatusBuffer that coalesces identical WAITING and DONE statuses to reduce memory amplification. It also implements deduplication for queued and in-flight status requests in the BlockStatusRequestQueue and adds a buffering/merging mechanism in the Maintainer to process incoming status updates more efficiently. Batch sizes for status requests are now capped, and resending of fan-out pass actions is throttled. Review feedback suggests that the Maintainer should buffer status requests even before initialization is finished to avoid unnecessary delays caused by dropping early messages.

Comment thread maintainer/status_request_buffer.go Outdated
Comment on lines +87 to +90
if !m.initialized.Load() {
m.recordDroppedStatusRequest(event.message.Type, "maintainer not initialized")
return true, false
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Dropping heartbeats and block status requests when the maintainer is not yet initialized can lead to unnecessary delays in DDL processing and status tracking. While dispatchers will eventually resend these statuses (typically every 1 second), it would be more efficient to buffer them even before initialization is complete, and then process the accumulated statuses once the maintainer is ready. This avoids waiting for the next resend cycle.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
downstreamadapter/dispatcher/block_status_buffer.go (1)

61-71: ⚡ Quick win

Avoid recomputing WAITING dedup keys during dequeue.

materialize currently derives the WAITING key from entry.status again. If that protobuf gets mutated after Offer, the original pending key may never be removed, causing persistent over-deduplication for that key.

♻️ Suggested localized fix
 type blockStatusQueueEntry struct {
 	status  *heartbeatpb.TableSpanBlockStatus
+	waitingKey *blockStatusKey
 	doneKey *blockStatusKey
 }

 func (b *BlockStatusBuffer) Offer(status *heartbeatpb.TableSpanBlockStatus) {
@@
 	if isWaitingBlockStatus(status) {
 		key := newBlockStatusKey(status)
 		if !b.reserveWaiting(key) {
 			return
 		}
-		b.queue <- blockStatusQueueEntry{status: status}
+		b.queue <- blockStatusQueueEntry{status: status, waitingKey: &key}
 		return
 	}
@@
 func (b *BlockStatusBuffer) materialize(entry blockStatusQueueEntry) *heartbeatpb.TableSpanBlockStatus {
 	if entry.status != nil {
-		if isWaitingBlockStatus(entry.status) {
-			key := newBlockStatusKey(entry.status)
-			b.mu.Lock()
-			delete(b.pendingWaiting, key)
-			b.mu.Unlock()
-		}
+		if entry.waitingKey != nil {
+			b.mu.Lock()
+			delete(b.pendingWaiting, *entry.waitingKey)
+			b.mu.Unlock()
+		}
 		return entry.status
 	}

Also applies to: 145-151

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/dispatcher/block_status_buffer.go` around lines 61 - 71,
The WAITING dedup key is recomputed from entry.status during dequeue (in
materialize), which breaks if the protobuf is mutated after Offer; modify
blockStatusQueueEntry to include and carry the precomputed waiting key (created
by newBlockStatusKey in Offer when isWaitingBlockStatus is true), push that key
into the queue entry in Offer (instead of recomputing later), and update
materialize and any dequeue logic (the code referencing entry.status to
re-derive the key around reserveWaiting/newBlockStatusKey usage) to use the
carried key for removing the pending reservation; ensure functions/structs
touched include BlockStatusBuffer.Offer, blockStatusQueueEntry, reserveWaiting,
and materialize so the original pending key is reliably removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@downstreamadapter/dispatchermanager/dispatcher_manager.go`:
- Around line 612-621: The loop is assigning message.BlockStatuses =
blockStatusMessage[start:end] which retains the large backing array; instead
allocate a new slice and copy the chunk contents before enqueuing to avoid
holding the full backing array in memory. For each iteration, create a new slice
(length = end-start), copy blockStatusMessage[start:end] into it, set
message.BlockStatuses to that new slice, then wrap into
BlockStatusRequestWithTargetID and call e.blockStatusRequestQueue.Enqueue; keep
the other fields (message.ChangefeedID = e.changefeedID.ToPB(), message.Mode =
mode, TargetID via e.GetMaintainerID()) unchanged.

In `@maintainer/status_request_buffer.go`:
- Around line 245-261: The drain logic is dropping buffered heartbeats that only
carry the completion flag; update the condition that decides to emit an event so
that a HeartBeatRequest with entry.completeStatus set is also forwarded. Locate
the loop over b.heartbeats in drain that constructs heartbeatpb.HeartBeatRequest
(fields: ChangefeedID, Statuses, Watermark, RedoWatermark, Err, CompeleteStatus)
and modify the final if-check (currently testing len(req.Statuses),
req.Watermark, req.RedoWatermark, req.Err) to also include req.CompeleteStatus
(or entry.completeStatus) so that mergeHeartbeat-preserved completion signals
are not dropped. Ensure newBufferedStatusEvent still receives the request
unchanged.
- Around line 118-121: Buffered status events are being replayed even after
shutdown begins; update handleBufferedStatusRequests to drop any buffered events
once m.removing or m.removed is set instead of calling HandleEvent on them.
Concretely, after calling takeBufferedStatusRequestEvents() (or before
processing each event) check the maintainer flags m.removing || m.removed and if
true discard the events (return/skip loop) so no stale heartbeat/block-status
events call HandleEvent; you can also filter the slice to keep events only while
removal is false, but ensure no calls to HandleEvent occur for dropped events.

---

Nitpick comments:
In `@downstreamadapter/dispatcher/block_status_buffer.go`:
- Around line 61-71: The WAITING dedup key is recomputed from entry.status
during dequeue (in materialize), which breaks if the protobuf is mutated after
Offer; modify blockStatusQueueEntry to include and carry the precomputed waiting
key (created by newBlockStatusKey in Offer when isWaitingBlockStatus is true),
push that key into the queue entry in Offer (instead of recomputing later), and
update materialize and any dequeue logic (the code referencing entry.status to
re-derive the key around reserveWaiting/newBlockStatusKey usage) to use the
carried key for removing the pending reservation; ensure functions/structs
touched include BlockStatusBuffer.Offer, blockStatusQueueEntry, reserveWaiting,
and materialize so the original pending key is reliably removed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c475db00-27be-4f38-bd91-2e5a3d3e7fee

📥 Commits

Reviewing files that changed from the base of the PR and between ba9fcf5 and 3242aff.

📒 Files selected for processing (17)
  • downstreamadapter/dispatcher/basic_dispatcher.go
  • downstreamadapter/dispatcher/basic_dispatcher_info.go
  • downstreamadapter/dispatcher/block_status_buffer.go
  • downstreamadapter/dispatcher/block_status_buffer_test.go
  • downstreamadapter/dispatcher/event_dispatcher.go
  • downstreamadapter/dispatcher/event_dispatcher_test.go
  • downstreamadapter/dispatcher/helper.go
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_test.go
  • downstreamadapter/dispatchermanager/heartbeat_collector.go
  • downstreamadapter/dispatchermanager/heartbeat_queue.go
  • downstreamadapter/dispatchermanager/heartbeat_queue_test.go
  • maintainer/barrier_event.go
  • maintainer/barrier_event_test.go
  • maintainer/maintainer.go
  • maintainer/maintainer_test.go
  • maintainer/status_request_buffer.go
💤 Files with no reviewable changes (1)
  • downstreamadapter/dispatcher/event_dispatcher.go

Comment on lines +612 to +621
for start := 0; start < len(blockStatusMessage); start += maxBlockStatusesPerRequest {
end := start + maxBlockStatusesPerRequest
if end > len(blockStatusMessage) {
end = len(blockStatusMessage)
}
var message heartbeatpb.BlockStatusRequest
message.ChangefeedID = e.changefeedID.ToPB()
message.BlockStatuses = blockStatusMessage[start:end]
message.Mode = mode
e.blockStatusRequestQueue.Enqueue(&BlockStatusRequestWithTargetID{TargetID: e.GetMaintainerID(), Request: &message})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Split chunks should not share one large backing slice.

At Line 619, message.BlockStatuses = blockStatusMessage[start:end] keeps references to the full batch backing array. Under queueing pressure, this can retain much more memory than intended even after splitting.

🛠️ Suggested fix
 		for start := 0; start < len(blockStatusMessage); start += maxBlockStatusesPerRequest {
 			end := start + maxBlockStatusesPerRequest
 			if end > len(blockStatusMessage) {
 				end = len(blockStatusMessage)
 			}
+			chunk := make([]*heartbeatpb.TableSpanBlockStatus, end-start)
+			copy(chunk, blockStatusMessage[start:end])
 			var message heartbeatpb.BlockStatusRequest
 			message.ChangefeedID = e.changefeedID.ToPB()
-			message.BlockStatuses = blockStatusMessage[start:end]
+			message.BlockStatuses = chunk
 			message.Mode = mode
 			e.blockStatusRequestQueue.Enqueue(&BlockStatusRequestWithTargetID{TargetID: e.GetMaintainerID(), Request: &message})
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/dispatchermanager/dispatcher_manager.go` around lines 612 -
621, The loop is assigning message.BlockStatuses = blockStatusMessage[start:end]
which retains the large backing array; instead allocate a new slice and copy the
chunk contents before enqueuing to avoid holding the full backing array in
memory. For each iteration, create a new slice (length = end-start), copy
blockStatusMessage[start:end] into it, set message.BlockStatuses to that new
slice, then wrap into BlockStatusRequestWithTargetID and call
e.blockStatusRequestQueue.Enqueue; keep the other fields (message.ChangefeedID =
e.changefeedID.ToPB(), message.Mode = mode, TargetID via e.GetMaintainerID())
unchanged.

Comment thread maintainer/status_request_buffer.go Outdated
Comment thread maintainer/status_request_buffer.go Outdated
Comment on lines +245 to +261
for from, entry := range b.heartbeats {
req := &heartbeatpb.HeartBeatRequest{
ChangefeedID: entry.changefeedID,
Statuses: make([]*heartbeatpb.TableSpanStatus, 0, len(entry.order)),
Watermark: entry.watermark,
RedoWatermark: entry.redoWatermark,
Err: entry.err,
CompeleteStatus: entry.completeStatus,
}
for _, key := range entry.order {
if status, ok := entry.statuses[key]; ok {
req.Statuses = append(req.Statuses, status)
}
}
if len(req.Statuses) > 0 || req.Watermark != nil || req.RedoWatermark != nil || req.Err != nil {
events = append(events, newBufferedStatusEvent(changefeedID, from, messaging.TypeHeartBeatRequest, req))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't discard heartbeat batches whose only signal is CompeleteStatus.

mergeHeartbeat preserves entry.completeStatus, but drain only emits a heartbeat event when statuses, watermarks, or Err exist. A buffered heartbeat with just CompeleteStatus=true is dropped here, so that flag never reaches normal handling.

Suggested fix
-		if len(req.Statuses) > 0 || req.Watermark != nil || req.RedoWatermark != nil || req.Err != nil {
+		if len(req.Statuses) > 0 || req.Watermark != nil || req.RedoWatermark != nil || req.Err != nil || req.CompeleteStatus {
 			events = append(events, newBufferedStatusEvent(changefeedID, from, messaging.TypeHeartBeatRequest, req))
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for from, entry := range b.heartbeats {
req := &heartbeatpb.HeartBeatRequest{
ChangefeedID: entry.changefeedID,
Statuses: make([]*heartbeatpb.TableSpanStatus, 0, len(entry.order)),
Watermark: entry.watermark,
RedoWatermark: entry.redoWatermark,
Err: entry.err,
CompeleteStatus: entry.completeStatus,
}
for _, key := range entry.order {
if status, ok := entry.statuses[key]; ok {
req.Statuses = append(req.Statuses, status)
}
}
if len(req.Statuses) > 0 || req.Watermark != nil || req.RedoWatermark != nil || req.Err != nil {
events = append(events, newBufferedStatusEvent(changefeedID, from, messaging.TypeHeartBeatRequest, req))
}
for from, entry := range b.heartbeats {
req := &heartbeatpb.HeartBeatRequest{
ChangefeedID: entry.changefeedID,
Statuses: make([]*heartbeatpb.TableSpanStatus, 0, len(entry.order)),
Watermark: entry.watermark,
RedoWatermark: entry.redoWatermark,
Err: entry.err,
CompeleteStatus: entry.completeStatus,
}
for _, key := range entry.order {
if status, ok := entry.statuses[key]; ok {
req.Statuses = append(req.Statuses, status)
}
}
if len(req.Statuses) > 0 || req.Watermark != nil || req.RedoWatermark != nil || req.Err != nil || req.CompeleteStatus {
events = append(events, newBufferedStatusEvent(changefeedID, from, messaging.TypeHeartBeatRequest, req))
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maintainer/status_request_buffer.go` around lines 245 - 261, The drain logic
is dropping buffered heartbeats that only carry the completion flag; update the
condition that decides to emit an event so that a HeartBeatRequest with
entry.completeStatus set is also forwarded. Locate the loop over b.heartbeats in
drain that constructs heartbeatpb.HeartBeatRequest (fields: ChangefeedID,
Statuses, Watermark, RedoWatermark, Err, CompeleteStatus) and modify the final
if-check (currently testing len(req.Statuses), req.Watermark, req.RedoWatermark,
req.Err) to also include req.CompeleteStatus (or entry.completeStatus) so that
mergeHeartbeat-preserved completion signals are not dropped. Ensure
newBufferedStatusEvent still receives the request unchanged.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 12, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant