Title: redo gating caches mixed DML batches and leaves stale rows after failover in fail_over_ddl_mix_random_delay
What did you do?
Observed a CI failure in the integration test fail_over_ddl_mix_random_delay.
Failed CI artifact:
https://prow.tidb.net/jenkins/job/pingcap/job/ticdc/job/pull_cdc_mysql_integration_heavy/660/artifact/log-G13.tar.gz
I analyzed the extracted logs under:
/tmp/ci-logs-failover-ddl-mix/tmp/tidb_cdc_test/fail_over_ddl_mix_random_delay
The key observation is:
- The failure is not caused by a partial MySQL sink commit or an incorrect sink ACK.
- The source update events for the mismatched rows definitely exist.
- The default dispatcher for
test.table_2 does receive those events after failover.
- The failure happens because redo gating is applied to the whole
dispatcherEvents batch instead of splitting the batch by redoGlobalTs.
Detailed analysis:
1. Failure symptom
sync_diff reports a mismatch only on test.table_2, and the row count is still equal on both sides (439 vs 439), so this is a stale-value problem rather than missing rows.
The mismatched rows are:
id=493
id=495
id=499
id=501
id=503
id=505
For all of them:
- upstream value:
complex_1772943173_21640_2992
- downstream value: still the old
insert_* value
Relevant evidence:
sync_diff/output/sync_diff.log:118
sync_diff/output/sync_diff.log:120
sync_diff/output/sync_diff.log:122
sync_diff/output/sync_diff.log:124
sync_diff/output/sync_diff.log:126
sync_diff/output/sync_diff.log:128
sync_diff/output/sync_diff.log:130
2. The source update event definitely exists
The update that changes those rows to complex_1772943173_21640_2992 is present in the log:
That DMLEvent is for:
- dispatcher:
63625024539555550195872247653604682627
- table:
test.table_2
commitTs=464766415209758732
It explicitly contains these row updates:
493: insert_1772943170_30594_2960 -> complex_1772943173_21640_2992
495: insert_1772943170_7579_2962 -> complex_1772943173_21640_2992
499: insert_1772943170_15778_2966 -> complex_1772943173_21640_2992
501: insert_1772943170_29650_2969 -> complex_1772943173_21640_2992
503: insert_1772943171_19699_2977 -> complex_1772943173_21640_2992
505: insert_1772943172_17582_2979 -> complex_1772943173_21640_2992
So this is not an upstream event generation problem.
3. The default dispatcher receives the event after failover
For this case, mode=0 is the default dispatcher and mode=1 is the redo dispatcher (pkg/common/types.go:334-345).
The dispatcher that actually matters for downstream MySQL consistency is the default dispatcher:
63625024539555550195872247653604682627
After failover, this dispatcher is reset and starts replaying from:
It then receives the critical event sequence around the failing update:
That segment shows the dispatcher receiving:
- seq
20 -> commitTs=464766415209758732
- seq
22 -> commitTs=464766415275032595
- seq
28 -> commitTs=464766415485009927
- seq
30 -> commitTs=464766415576498198
- seq
31 -> commitTs=464766415642034197
So this is not a case where failover dropped the event before it reached the new dispatcher.
4. The actual failure point: the whole mixed batch is cached
The key log is:
It shows:
- dispatcher:
63625024539555550195872247653604682627
dispatcherResolvedTs=464766414816542768
length=152
- last event
commitTs=464766420767211526
redoGlobalTs=464766417673650203
Current code:
downstreamadapter/dispatcher/event_dispatcher.go:135-140
if d.redoEnable && len(dispatcherEvents) > 0 &&
d.redoGlobalTs.Load() < dispatcherEvents[len(dispatcherEvents)-1].Event.GetCommitTs() {
d.cache(dispatcherEvents, wakeCallback)
return true
}
This logic only checks the last event in the batch.
That means:
- if the last event in the batch is newer than
redoGlobalTs,
- the whole batch is cached,
- even if the front part of the batch already has
commitTs <= redoGlobalTs and is safe to process.
In this failure, the batch is clearly mixed:
- it already contains
commitTs=464766415209758732
- but the last event in the same batch is
464766420767211526
- and
redoGlobalTs is 464766417673650203
So the update at 464766415209758732 should have been eligible, but it was cached together with later ineligible events.
5. Why we can tell the update never flushed to MySQL
The dispatcher exits with:
Values:
checkpointTs=464766414921138204
resolvedTs=464766414816542768
The important point is:
checkpointTs is still smaller than the failing update commitTs 464766415209758732
This means the default dispatcher never advanced its flushed progress past that update.
That is consistent with the dispatcher progress semantics:
- inflight DML is tracked in
downstreamadapter/dispatcher/table_progress.go:96-112
- checkpoint is derived from the earliest unflushed event in
downstreamadapter/dispatcher/table_progress.go:174-185
BasicDispatcher.GetCheckpointTs() reads from that structure in downstreamadapter/dispatcher/basic_dispatcher.go:511-522
PostFlush() is what removes an event from inflight progress in pkg/common/event/dml_event.go:643-654
6. Why this is not a partial MySQL sink commit or a false ACK
This path can be ruled out from the code:
pkg/sink/mysql/mysql_writer.go:208-255
pkg/sink/mysql/mysql_writer_dml_exec.go:48-90
pkg/sink/mysql/mysql_writer_dml_exec.go:178-215
Facts:
Writer.Flush() only calls event.PostFlush() after execDMLWithMaxRetries() succeeds.
- The sequential execution path runs in a SQL transaction (
BeginTx + Commit).
- The multi-statement path also wraps SQL in
BEGIN; ...; COMMIT;, and rolls back on error.
So there is no normal path where:
- some SQLs in the batch succeed,
- the whole batch is marked flushed,
- and checkpoint advances incorrectly.
The failure is earlier: eligible DMLs were never processed because the whole replay batch was cached.
7. Important nuance: redo dispatcher is ahead, but the default dispatcher is not
The logs also show that redo is further ahead:
cdc1-restart-8.log:20354
cdc1-restart-8.log:22577
For example, redo metadata reaches:
resolvedTs=464766417673650203
and the redo dispatcher for the same table exits at:
checkpointTs=464766420898021376
But that does not mean downstream MySQL is correct.
The stale rows are explained by the mode=0 default dispatcher still being blocked behind coarse redo gating.
What did you expect to see?
Even when redo is enabled, the default dispatcher should process all events in a batch whose commitTs <= redoGlobalTs, and only delay the tail portion that is still ahead of redo progress.
For this test, test.table_2 should have converged to the upstream state, and the update at commitTs=464766415209758732 should have reached downstream MySQL.
What did you see instead?
The whole replay batch was cached because its last event was newer than redoGlobalTs.
As a result:
- the eligible update at
commitTs=464766415209758732 did not flush to MySQL,
- six rows in
test.table_2 remained at their older insert_* values,
sync_diff failed with stale values.
Versions of the cluster
Upstream TiDB cluster version (from CI logs):
Release Version: v9.0.0-beta.2.pre-1314-g999e8f4
Git Commit Hash: 999e8f4ce9bae23c720fcfb4a5f43d7bae15c3e8
Store: tikv
Upstream TiKV version:
Not captured directly in the CI artifact. The TiDB version output reports Store: tikv.
TiCDC version (from cdc1-restart-8.log:3 / cdc0-restart-9.log:3):
release-version=v8.5.4-nextgen.202510.5-109-gc86a33f5
git-hash=c86a33f5ab730444554920554847ac374e450633
git-branch=HEAD
go-version=go1.25.5
Suggested fix
Minimal fix
Change redo gating in downstreamadapter/dispatcher/event_dispatcher.go to split a mixed dispatcherEvents batch at the first event whose commitTs > redoGlobalTs.
Expected behavior:
- process the prefix where
commitTs <= redoGlobalTs
- cache only the suffix where
commitTs > redoGlobalTs
This should preserve ordering while avoiding starvation of already-eligible DMLs.
Safer follow-up
Add targeted tests for:
- redo-enabled replay with a mixed batch containing both eligible and ineligible DML events
- failover/replay of a default dispatcher where the replay batch crosses
redoGlobalTs
- regression coverage for stale-row outcomes similar to
fail_over_ddl_mix_random_delay
Optional observability improvement
When caching due to redo gating, log:
- the first blocked commitTs
- the number of eligible events skipped in the same batch
- whether the batch is mixed or fully blocked
That would make future diagnosis much easier.
Title: redo gating caches mixed DML batches and leaves stale rows after failover in
fail_over_ddl_mix_random_delayWhat did you do?
Observed a CI failure in the integration test
fail_over_ddl_mix_random_delay.Failed CI artifact:
https://prow.tidb.net/jenkins/job/pingcap/job/ticdc/job/pull_cdc_mysql_integration_heavy/660/artifact/log-G13.tar.gz
I analyzed the extracted logs under:
/tmp/ci-logs-failover-ddl-mix/tmp/tidb_cdc_test/fail_over_ddl_mix_random_delayThe key observation is:
test.table_2does receive those events after failover.dispatcherEventsbatch instead of splitting the batch byredoGlobalTs.Detailed analysis:
1. Failure symptom
sync_diffreports a mismatch only ontest.table_2, and the row count is still equal on both sides (439 vs 439), so this is a stale-value problem rather than missing rows.The mismatched rows are:
id=493id=495id=499id=501id=503id=505For all of them:
complex_1772943173_21640_2992insert_*valueRelevant evidence:
sync_diff/output/sync_diff.log:118sync_diff/output/sync_diff.log:120sync_diff/output/sync_diff.log:122sync_diff/output/sync_diff.log:124sync_diff/output/sync_diff.log:126sync_diff/output/sync_diff.log:128sync_diff/output/sync_diff.log:1302. The source update event definitely exists
The update that changes those rows to
complex_1772943173_21640_2992is present in the log:cdc0-restart-9.log:9209That
DMLEventis for:63625024539555550195872247653604682627test.table_2commitTs=464766415209758732It explicitly contains these row updates:
493: insert_1772943170_30594_2960 -> complex_1772943173_21640_2992495: insert_1772943170_7579_2962 -> complex_1772943173_21640_2992499: insert_1772943170_15778_2966 -> complex_1772943173_21640_2992501: insert_1772943170_29650_2969 -> complex_1772943173_21640_2992503: insert_1772943171_19699_2977 -> complex_1772943173_21640_2992505: insert_1772943172_17582_2979 -> complex_1772943173_21640_2992So this is not an upstream event generation problem.
3. The default dispatcher receives the event after failover
For this case,
mode=0is the default dispatcher andmode=1is the redo dispatcher (pkg/common/types.go:334-345).The dispatcher that actually matters for downstream MySQL consistency is the default dispatcher:
63625024539555550195872247653604682627After failover, this dispatcher is reset and starts replaying from:
cdc1-restart-8.log:20455It then receives the critical event sequence around the failing update:
cdc1-restart-8.log:20814That segment shows the dispatcher receiving:
20->commitTs=46476641520975873222->commitTs=46476641527503259528->commitTs=46476641548500992730->commitTs=46476641557649819831->commitTs=464766415642034197So this is not a case where failover dropped the event before it reached the new dispatcher.
4. The actual failure point: the whole mixed batch is cached
The key log is:
cdc1-restart-8.log:20958It shows:
63625024539555550195872247653604682627dispatcherResolvedTs=464766414816542768length=152commitTs=464766420767211526redoGlobalTs=464766417673650203Current code:
downstreamadapter/dispatcher/event_dispatcher.go:135-140This logic only checks the last event in the batch.
That means:
redoGlobalTs,commitTs <= redoGlobalTsand is safe to process.In this failure, the batch is clearly mixed:
commitTs=464766415209758732464766420767211526redoGlobalTsis464766417673650203So the update at
464766415209758732should have been eligible, but it was cached together with later ineligible events.5. Why we can tell the update never flushed to MySQL
The dispatcher exits with:
cdc1-restart-8.log:22610Values:
checkpointTs=464766414921138204resolvedTs=464766414816542768The important point is:
checkpointTsis still smaller than the failing update commitTs464766415209758732This means the default dispatcher never advanced its flushed progress past that update.
That is consistent with the dispatcher progress semantics:
downstreamadapter/dispatcher/table_progress.go:96-112downstreamadapter/dispatcher/table_progress.go:174-185BasicDispatcher.GetCheckpointTs()reads from that structure indownstreamadapter/dispatcher/basic_dispatcher.go:511-522PostFlush()is what removes an event from inflight progress inpkg/common/event/dml_event.go:643-6546. Why this is not a partial MySQL sink commit or a false ACK
This path can be ruled out from the code:
pkg/sink/mysql/mysql_writer.go:208-255pkg/sink/mysql/mysql_writer_dml_exec.go:48-90pkg/sink/mysql/mysql_writer_dml_exec.go:178-215Facts:
Writer.Flush()only callsevent.PostFlush()afterexecDMLWithMaxRetries()succeeds.BeginTx+Commit).BEGIN; ...; COMMIT;, and rolls back on error.So there is no normal path where:
The failure is earlier: eligible DMLs were never processed because the whole replay batch was cached.
7. Important nuance: redo dispatcher is ahead, but the default dispatcher is not
The logs also show that redo is further ahead:
cdc1-restart-8.log:20354cdc1-restart-8.log:22577For example, redo metadata reaches:
resolvedTs=464766417673650203and the redo dispatcher for the same table exits at:
checkpointTs=464766420898021376But that does not mean downstream MySQL is correct.
The stale rows are explained by the
mode=0default dispatcher still being blocked behind coarse redo gating.What did you expect to see?
Even when redo is enabled, the default dispatcher should process all events in a batch whose
commitTs <= redoGlobalTs, and only delay the tail portion that is still ahead of redo progress.For this test,
test.table_2should have converged to the upstream state, and the update atcommitTs=464766415209758732should have reached downstream MySQL.What did you see instead?
The whole replay batch was cached because its last event was newer than
redoGlobalTs.As a result:
commitTs=464766415209758732did not flush to MySQL,test.table_2remained at their olderinsert_*values,sync_difffailed with stale values.Versions of the cluster
Upstream TiDB cluster version (from CI logs):
Upstream TiKV version:
Not captured directly in the CI artifact. The TiDB version output reports Store: tikv.TiCDC version (from
cdc1-restart-8.log:3/cdc0-restart-9.log:3):Suggested fix
Minimal fix
Change redo gating in
downstreamadapter/dispatcher/event_dispatcher.goto split a mixeddispatcherEventsbatch at the first event whosecommitTs > redoGlobalTs.Expected behavior:
commitTs <= redoGlobalTscommitTs > redoGlobalTsThis should preserve ordering while avoiding starvation of already-eligible DMLs.
Safer follow-up
Add targeted tests for:
redoGlobalTsfail_over_ddl_mix_random_delayOptional observability improvement
When caching due to redo gating, log:
That would make future diagnosis much easier.