Skip to content

redo gating caches mixed DML batches and leaves stale rows after failover in fail_over_ddl_mix_random_delay #4390

@3AceShowHand

Description

@3AceShowHand

Title: redo gating caches mixed DML batches and leaves stale rows after failover in fail_over_ddl_mix_random_delay

What did you do?

Observed a CI failure in the integration test fail_over_ddl_mix_random_delay.

Failed CI artifact:
https://prow.tidb.net/jenkins/job/pingcap/job/ticdc/job/pull_cdc_mysql_integration_heavy/660/artifact/log-G13.tar.gz

I analyzed the extracted logs under:

  • /tmp/ci-logs-failover-ddl-mix/tmp/tidb_cdc_test/fail_over_ddl_mix_random_delay

The key observation is:

  1. The failure is not caused by a partial MySQL sink commit or an incorrect sink ACK.
  2. The source update events for the mismatched rows definitely exist.
  3. The default dispatcher for test.table_2 does receive those events after failover.
  4. The failure happens because redo gating is applied to the whole dispatcherEvents batch instead of splitting the batch by redoGlobalTs.

Detailed analysis:

1. Failure symptom

sync_diff reports a mismatch only on test.table_2, and the row count is still equal on both sides (439 vs 439), so this is a stale-value problem rather than missing rows.

The mismatched rows are:

  • id=493
  • id=495
  • id=499
  • id=501
  • id=503
  • id=505

For all of them:

  • upstream value: complex_1772943173_21640_2992
  • downstream value: still the old insert_* value

Relevant evidence:

  • sync_diff/output/sync_diff.log:118
  • sync_diff/output/sync_diff.log:120
  • sync_diff/output/sync_diff.log:122
  • sync_diff/output/sync_diff.log:124
  • sync_diff/output/sync_diff.log:126
  • sync_diff/output/sync_diff.log:128
  • sync_diff/output/sync_diff.log:130

2. The source update event definitely exists

The update that changes those rows to complex_1772943173_21640_2992 is present in the log:

  • cdc0-restart-9.log:9209

That DMLEvent is for:

  • dispatcher: 63625024539555550195872247653604682627
  • table: test.table_2
  • commitTs=464766415209758732

It explicitly contains these row updates:

  • 493: insert_1772943170_30594_2960 -> complex_1772943173_21640_2992
  • 495: insert_1772943170_7579_2962 -> complex_1772943173_21640_2992
  • 499: insert_1772943170_15778_2966 -> complex_1772943173_21640_2992
  • 501: insert_1772943170_29650_2969 -> complex_1772943173_21640_2992
  • 503: insert_1772943171_19699_2977 -> complex_1772943173_21640_2992
  • 505: insert_1772943172_17582_2979 -> complex_1772943173_21640_2992

So this is not an upstream event generation problem.

3. The default dispatcher receives the event after failover

For this case, mode=0 is the default dispatcher and mode=1 is the redo dispatcher (pkg/common/types.go:334-345).

The dispatcher that actually matters for downstream MySQL consistency is the default dispatcher:

  • 63625024539555550195872247653604682627

After failover, this dispatcher is reset and starts replaying from:

  • cdc1-restart-8.log:20455

It then receives the critical event sequence around the failing update:

  • cdc1-restart-8.log:20814

That segment shows the dispatcher receiving:

  • seq 20 -> commitTs=464766415209758732
  • seq 22 -> commitTs=464766415275032595
  • seq 28 -> commitTs=464766415485009927
  • seq 30 -> commitTs=464766415576498198
  • seq 31 -> commitTs=464766415642034197

So this is not a case where failover dropped the event before it reached the new dispatcher.

4. The actual failure point: the whole mixed batch is cached

The key log is:

  • cdc1-restart-8.log:20958

It shows:

  • dispatcher: 63625024539555550195872247653604682627
  • dispatcherResolvedTs=464766414816542768
  • length=152
  • last event commitTs=464766420767211526
  • redoGlobalTs=464766417673650203

Current code:

  • downstreamadapter/dispatcher/event_dispatcher.go:135-140
if d.redoEnable && len(dispatcherEvents) > 0 &&
    d.redoGlobalTs.Load() < dispatcherEvents[len(dispatcherEvents)-1].Event.GetCommitTs() {
    d.cache(dispatcherEvents, wakeCallback)
    return true
}

This logic only checks the last event in the batch.

That means:

  • if the last event in the batch is newer than redoGlobalTs,
  • the whole batch is cached,
  • even if the front part of the batch already has commitTs <= redoGlobalTs and is safe to process.

In this failure, the batch is clearly mixed:

  • it already contains commitTs=464766415209758732
  • but the last event in the same batch is 464766420767211526
  • and redoGlobalTs is 464766417673650203

So the update at 464766415209758732 should have been eligible, but it was cached together with later ineligible events.

5. Why we can tell the update never flushed to MySQL

The dispatcher exits with:

  • cdc1-restart-8.log:22610

Values:

  • checkpointTs=464766414921138204
  • resolvedTs=464766414816542768

The important point is:

  • checkpointTs is still smaller than the failing update commitTs 464766415209758732

This means the default dispatcher never advanced its flushed progress past that update.

That is consistent with the dispatcher progress semantics:

  • inflight DML is tracked in downstreamadapter/dispatcher/table_progress.go:96-112
  • checkpoint is derived from the earliest unflushed event in downstreamadapter/dispatcher/table_progress.go:174-185
  • BasicDispatcher.GetCheckpointTs() reads from that structure in downstreamadapter/dispatcher/basic_dispatcher.go:511-522
  • PostFlush() is what removes an event from inflight progress in pkg/common/event/dml_event.go:643-654

6. Why this is not a partial MySQL sink commit or a false ACK

This path can be ruled out from the code:

  • pkg/sink/mysql/mysql_writer.go:208-255
  • pkg/sink/mysql/mysql_writer_dml_exec.go:48-90
  • pkg/sink/mysql/mysql_writer_dml_exec.go:178-215

Facts:

  1. Writer.Flush() only calls event.PostFlush() after execDMLWithMaxRetries() succeeds.
  2. The sequential execution path runs in a SQL transaction (BeginTx + Commit).
  3. The multi-statement path also wraps SQL in BEGIN; ...; COMMIT;, and rolls back on error.

So there is no normal path where:

  • some SQLs in the batch succeed,
  • the whole batch is marked flushed,
  • and checkpoint advances incorrectly.

The failure is earlier: eligible DMLs were never processed because the whole replay batch was cached.

7. Important nuance: redo dispatcher is ahead, but the default dispatcher is not

The logs also show that redo is further ahead:

  • cdc1-restart-8.log:20354
  • cdc1-restart-8.log:22577

For example, redo metadata reaches:

  • resolvedTs=464766417673650203

and the redo dispatcher for the same table exits at:

  • checkpointTs=464766420898021376

But that does not mean downstream MySQL is correct.

The stale rows are explained by the mode=0 default dispatcher still being blocked behind coarse redo gating.

What did you expect to see?

Even when redo is enabled, the default dispatcher should process all events in a batch whose commitTs <= redoGlobalTs, and only delay the tail portion that is still ahead of redo progress.

For this test, test.table_2 should have converged to the upstream state, and the update at commitTs=464766415209758732 should have reached downstream MySQL.

What did you see instead?

The whole replay batch was cached because its last event was newer than redoGlobalTs.

As a result:

  • the eligible update at commitTs=464766415209758732 did not flush to MySQL,
  • six rows in test.table_2 remained at their older insert_* values,
  • sync_diff failed with stale values.

Versions of the cluster

Upstream TiDB cluster version (from CI logs):

Release Version: v9.0.0-beta.2.pre-1314-g999e8f4
Git Commit Hash: 999e8f4ce9bae23c720fcfb4a5f43d7bae15c3e8
Store: tikv

Upstream TiKV version:

Not captured directly in the CI artifact. The TiDB version output reports Store: tikv.

TiCDC version (from cdc1-restart-8.log:3 / cdc0-restart-9.log:3):

release-version=v8.5.4-nextgen.202510.5-109-gc86a33f5
git-hash=c86a33f5ab730444554920554847ac374e450633
git-branch=HEAD
go-version=go1.25.5

Suggested fix

Minimal fix

Change redo gating in downstreamadapter/dispatcher/event_dispatcher.go to split a mixed dispatcherEvents batch at the first event whose commitTs > redoGlobalTs.

Expected behavior:

  • process the prefix where commitTs <= redoGlobalTs
  • cache only the suffix where commitTs > redoGlobalTs

This should preserve ordering while avoiding starvation of already-eligible DMLs.

Safer follow-up

Add targeted tests for:

  1. redo-enabled replay with a mixed batch containing both eligible and ineligible DML events
  2. failover/replay of a default dispatcher where the replay batch crosses redoGlobalTs
  3. regression coverage for stale-row outcomes similar to fail_over_ddl_mix_random_delay

Optional observability improvement

When caching due to redo gating, log:

  • the first blocked commitTs
  • the number of eligible events skipped in the same batch
  • whether the batch is mixed or fully blocked

That would make future diagnosis much easier.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions