Skip to content

[Bug] When a master node switchover occurs on the Doris FE, CCR restarts fullsync #63269

@dzmxcyr

Description

@dzmxcyr

Search before asking

  • I had searched in the issues and found no similar issues.

Version

Source Doris:2.1.11-x64
Target Doris:2.1.11-arm64
CCR:ccr-syncer-3.0.6-rc05-arm64

What's Wrong?

After creating a database-level CCR replication task, CCR starts full synchronization normally and then enters the incremental synchronization phase, with everything working properly.
However, when the upstream Doris FE master node switches to another node, CCR triggers a fullsync and pulls data again from scratch.
Due to the large volume of data, the synchronization takes a long time and has a significant impact on the production environment.

With the ccr log:
[2026-05-14 09:33:51.786] WARN call [:0] error: GetBinlog error: remote or network error: get connection error: dial tcp :0: connection has been closed by peer, req: TGetBinlogRequest({Cluster: User:0x40001a8378 Passwd:0x40001a8388 Db:0x40001a83a8 Table: TableId: UserIp: Token: PrevCommitSeq:0x400082e928 NumAcquired:0x400082e930}): [rpc] remote or network error: get connection error: dial tcp :0: connection has been closed by peer, try next addr job=CCR_PROD_ZHBB line=rpc/fe.go:259
...
[2026-05-14 09:33:52.149] WARN job sync failed, job: CCR_PROD_DW, err: [meta] index ids is empty
...
[2026-05-14 09:33:53.597] INFO fullsync status: create snapshot with prefix ccrs_CCR_PROD_DW_1778668141 job=CCR_PROD_DW line=ccr/job.go:973
[2026-05-14 09:33:53.694] INFO fullsync status: create snapshot ccrs_CCR_PROD_DW_1778668141_1778722433 job=CCR_PROD_DW line=ccr/job.go:1019
[2026-05-14 09:33:53.694] INFO create snapshot PROD_DW.ccrs_CCR_PROD_DW_1778668141_1778722433, backup snapshot sql: BACKUP SNAPSHOT PROD_DW.ccrs_CCR_PROD_DW_1778668141_1778722433 TO keep_on_local PROPERTIES ("type" = "full") job=CCR_PROD_DW line=base/spec.go:771

What You Expected?

CCR runs nomally after the Doris fe master node fails over to another node.

How to Reproduce?

When database-level CCR synchronization is running on the upstream cluster with continuous writes to a large number of tables, if the FE Master node goes down and a switchover occurs, CCR will trigger a fullsync again.

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions