redo gating caches mixed DML batches and leaves stale rows after failover in fail_over_ddl_mix_random_delay

Title: redo gating caches mixed DML batches and leaves stale rows after failover in `fail_over_ddl_mix_random_delay`

What did you do?

Observed a CI failure in the integration test `fail_over_ddl_mix_random_delay`.

Failed CI artifact:
https://prow.tidb.net/jenkins/job/pingcap/job/ticdc/job/pull_cdc_mysql_integration_heavy/660/artifact/log-G13.tar.gz

I analyzed the extracted logs under:

- `/tmp/ci-logs-failover-ddl-mix/tmp/tidb_cdc_test/fail_over_ddl_mix_random_delay`

The key observation is:

1. The failure is not caused by a partial MySQL sink commit or an incorrect sink ACK.
2. The source update events for the mismatched rows definitely exist.
3. The default dispatcher for `test.table_2` does receive those events after failover.
4. The failure happens because redo gating is applied to the whole `dispatcherEvents` batch instead of splitting the batch by `redoGlobalTs`.

Detailed analysis:

### 1. Failure symptom

`sync_diff` reports a mismatch only on `test.table_2`, and the row count is still equal on both sides (`439 vs 439`), so this is a stale-value problem rather than missing rows.

The mismatched rows are:

- `id=493`
- `id=495`
- `id=499`
- `id=501`
- `id=503`
- `id=505`

For all of them:

- upstream value: `complex_1772943173_21640_2992`
- downstream value: still the old `insert_*` value

Relevant evidence:

- `sync_diff/output/sync_diff.log:118`
- `sync_diff/output/sync_diff.log:120`
- `sync_diff/output/sync_diff.log:122`
- `sync_diff/output/sync_diff.log:124`
- `sync_diff/output/sync_diff.log:126`
- `sync_diff/output/sync_diff.log:128`
- `sync_diff/output/sync_diff.log:130`

### 2. The source update event definitely exists

The update that changes those rows to `complex_1772943173_21640_2992` is present in the log:

- `cdc0-restart-9.log:9209`

That `DMLEvent` is for:

- dispatcher: `63625024539555550195872247653604682627`
- table: `test.table_2`
- `commitTs=464766415209758732`

It explicitly contains these row updates:

- `493: insert_1772943170_30594_2960 -> complex_1772943173_21640_2992`
- `495: insert_1772943170_7579_2962 -> complex_1772943173_21640_2992`
- `499: insert_1772943170_15778_2966 -> complex_1772943173_21640_2992`
- `501: insert_1772943170_29650_2969 -> complex_1772943173_21640_2992`
- `503: insert_1772943171_19699_2977 -> complex_1772943173_21640_2992`
- `505: insert_1772943172_17582_2979 -> complex_1772943173_21640_2992`

So this is not an upstream event generation problem.

### 3. The default dispatcher receives the event after failover

For this case, `mode=0` is the default dispatcher and `mode=1` is the redo dispatcher (`pkg/common/types.go:334-345`).

The dispatcher that actually matters for downstream MySQL consistency is the default dispatcher:

- `63625024539555550195872247653604682627`

After failover, this dispatcher is reset and starts replaying from:

- `cdc1-restart-8.log:20455`

It then receives the critical event sequence around the failing update:

- `cdc1-restart-8.log:20814`

That segment shows the dispatcher receiving:

- seq `20` -> `commitTs=464766415209758732`
- seq `22` -> `commitTs=464766415275032595`
- seq `28` -> `commitTs=464766415485009927`
- seq `30` -> `commitTs=464766415576498198`
- seq `31` -> `commitTs=464766415642034197`

So this is not a case where failover dropped the event before it reached the new dispatcher.

### 4. The actual failure point: the whole mixed batch is cached

The key log is:

- `cdc1-restart-8.log:20958`

It shows:

- dispatcher: `63625024539555550195872247653604682627`
- `dispatcherResolvedTs=464766414816542768`
- `length=152`
- last event `commitTs=464766420767211526`
- `redoGlobalTs=464766417673650203`

Current code:

- `downstreamadapter/dispatcher/event_dispatcher.go:135-140`

```go
if d.redoEnable && len(dispatcherEvents) > 0 &&
    d.redoGlobalTs.Load() < dispatcherEvents[len(dispatcherEvents)-1].Event.GetCommitTs() {
    d.cache(dispatcherEvents, wakeCallback)
    return true
}
```

This logic only checks the last event in the batch.

That means:

- if the last event in the batch is newer than `redoGlobalTs`,
- the whole batch is cached,
- even if the front part of the batch already has `commitTs <= redoGlobalTs` and is safe to process.

In this failure, the batch is clearly mixed:

- it already contains `commitTs=464766415209758732`
- but the last event in the same batch is `464766420767211526`
- and `redoGlobalTs` is `464766417673650203`

So the update at `464766415209758732` should have been eligible, but it was cached together with later ineligible events.

### 5. Why we can tell the update never flushed to MySQL

The dispatcher exits with:

- `cdc1-restart-8.log:22610`

Values:

- `checkpointTs=464766414921138204`
- `resolvedTs=464766414816542768`

The important point is:

- `checkpointTs` is still smaller than the failing update commitTs `464766415209758732`

This means the default dispatcher never advanced its flushed progress past that update.

That is consistent with the dispatcher progress semantics:

- inflight DML is tracked in `downstreamadapter/dispatcher/table_progress.go:96-112`
- checkpoint is derived from the earliest unflushed event in `downstreamadapter/dispatcher/table_progress.go:174-185`
- `BasicDispatcher.GetCheckpointTs()` reads from that structure in `downstreamadapter/dispatcher/basic_dispatcher.go:511-522`
- `PostFlush()` is what removes an event from inflight progress in `pkg/common/event/dml_event.go:643-654`

### 6. Why this is not a partial MySQL sink commit or a false ACK

This path can be ruled out from the code:

- `pkg/sink/mysql/mysql_writer.go:208-255`
- `pkg/sink/mysql/mysql_writer_dml_exec.go:48-90`
- `pkg/sink/mysql/mysql_writer_dml_exec.go:178-215`

Facts:

1. `Writer.Flush()` only calls `event.PostFlush()` after `execDMLWithMaxRetries()` succeeds.
2. The sequential execution path runs in a SQL transaction (`BeginTx` + `Commit`).
3. The multi-statement path also wraps SQL in `BEGIN; ...; COMMIT;`, and rolls back on error.

So there is no normal path where:

- some SQLs in the batch succeed,
- the whole batch is marked flushed,
- and checkpoint advances incorrectly.

The failure is earlier: eligible DMLs were never processed because the whole replay batch was cached.

### 7. Important nuance: redo dispatcher is ahead, but the default dispatcher is not

The logs also show that redo is further ahead:

- `cdc1-restart-8.log:20354`
- `cdc1-restart-8.log:22577`

For example, redo metadata reaches:

- `resolvedTs=464766417673650203`

and the redo dispatcher for the same table exits at:

- `checkpointTs=464766420898021376`

But that does not mean downstream MySQL is correct.

The stale rows are explained by the `mode=0` default dispatcher still being blocked behind coarse redo gating.

What did you expect to see?

Even when redo is enabled, the default dispatcher should process all events in a batch whose `commitTs <= redoGlobalTs`, and only delay the tail portion that is still ahead of redo progress.

For this test, `test.table_2` should have converged to the upstream state, and the update at `commitTs=464766415209758732` should have reached downstream MySQL.

What did you see instead?

The whole replay batch was cached because its last event was newer than `redoGlobalTs`.

As a result:

- the eligible update at `commitTs=464766415209758732` did not flush to MySQL,
- six rows in `test.table_2` remained at their older `insert_*` values,
- `sync_diff` failed with stale values.

Versions of the cluster

Upstream TiDB cluster version (from CI logs):

```console
Release Version: v9.0.0-beta.2.pre-1314-g999e8f4
Git Commit Hash: 999e8f4ce9bae23c720fcfb4a5f43d7bae15c3e8
Store: tikv
```

Upstream TiKV version:

```console
Not captured directly in the CI artifact. The TiDB version output reports Store: tikv.
```

TiCDC version (from `cdc1-restart-8.log:3` / `cdc0-restart-9.log:3`):

```console
release-version=v8.5.4-nextgen.202510.5-109-gc86a33f5
git-hash=c86a33f5ab730444554920554847ac374e450633
git-branch=HEAD
go-version=go1.25.5
```

Suggested fix

### Minimal fix

Change redo gating in `downstreamadapter/dispatcher/event_dispatcher.go` to split a mixed `dispatcherEvents` batch at the first event whose `commitTs > redoGlobalTs`.

Expected behavior:

- process the prefix where `commitTs <= redoGlobalTs`
- cache only the suffix where `commitTs > redoGlobalTs`

This should preserve ordering while avoiding starvation of already-eligible DMLs.

### Safer follow-up

Add targeted tests for:

1. redo-enabled replay with a mixed batch containing both eligible and ineligible DML events
2. failover/replay of a default dispatcher where the replay batch crosses `redoGlobalTs`
3. regression coverage for stale-row outcomes similar to `fail_over_ddl_mix_random_delay`

### Optional observability improvement

When caching due to redo gating, log:

- the first blocked commitTs
- the number of eligible events skipped in the same batch
- whether the batch is mixed or fully blocked

That would make future diagnosis much easier.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redo gating caches mixed DML batches and leaves stale rows after failover in fail_over_ddl_mix_random_delay #4390

1. Failure symptom

2. The source update event definitely exists

3. The default dispatcher receives the event after failover

4. The actual failure point: the whole mixed batch is cached

5. Why we can tell the update never flushed to MySQL

6. Why this is not a partial MySQL sink commit or a false ACK

7. Important nuance: redo dispatcher is ahead, but the default dispatcher is not

Minimal fix

Safer follow-up

Optional observability improvement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

redo gating caches mixed DML batches and leaves stale rows after failover in fail_over_ddl_mix_random_delay #4390

Description

1. Failure symptom

2. The source update event definitely exists

3. The default dispatcher receives the event after failover

4. The actual failure point: the whole mixed batch is cached

5. Why we can tell the update never flushed to MySQL

6. Why this is not a partial MySQL sink commit or a false ACK

7. Important nuance: redo dispatcher is ahead, but the default dispatcher is not

Minimal fix

Safer follow-up

Optional observability improvement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions