Skip to content

feat(replication-restore): operator-side orchestration for cross-region replication restore (PR 1/2)#6861

Open
RidRisR wants to merge 28 commits into
pingcap:masterfrom
RidRisR:docs/replication-restore-design
Open

feat(replication-restore): operator-side orchestration for cross-region replication restore (PR 1/2)#6861
RidRisR wants to merge 28 commits into
pingcap:masterfrom
RidRisR:docs/replication-restore-design

Conversation

@RidRisR
Copy link
Copy Markdown
Contributor

@RidRisR RidRisR commented Apr 27, 2026

What problem does this PR solve?

Adds operator-side orchestration for replication restore — a cross-region disaster-recovery flow that defers log compaction from upstream to the downstream PITR phase, avoiding double cross-region transfer of log data. The CompactBackup mode: sharded feature shipped earlier provides parallel sharded compaction; this PR is the Restore-side companion that gates BR phase-2 (log restore) on CompactBackup terminal state.

This is PR 1 of 2. PR 2 (backup-manager CLI + BR --replication-storage-phase wiring) will follow once internal BR is ready.

Spec: docs/superpowers/specs/2026-04-23-replication-restore-design.md
Plan: docs/superpowers/plans/2026-04-23-replication-restore-pr1.md

What is changed and how does it work?

User triggers replication restore by setting + "spec.mode: pitr" + and a new + "spec.replicationConfig" + field on a Restore CR (referencing a CompactBackup CR by name). The controller dispatches a dedicated handler ( + "pkg/backup/restore/replication_handler.go" + ) that drives a two-phase state machine keyed on a new + "status.replicationStep" + string field:

  • "" / Scheduled → cross-CR consistency check → create phase-1 BR Job (snapshot restore) → + "status.phase = SnapshotRestore" + , + "replicationStep = snapshot-restore" +
  • snapshot-restore → observe phase-1 Job + CompactBackup → write + "SnapshotRestored" + and + "CompactSettled" + markers (without touching Phase) → when both markers True, transition to + "LogRestore" + and create phase-2 Job
  • log-restore → backup-manager drives + "Running/Complete/Failed" + via the standard updater (PR 2 wires the actual + "--replicationPhase=2" + arg)

Status write ownership uses Option B: the controller is the sole writer of + "status.phase" + during phase-1; backup-manager status writes are no-op'd via a wrapper ( + "NewReplicationRestoreStatusUpdater" + , activated by PR 2). This eliminates a v2-design hole where backup-manager's + "Running" + write would clobber controller-set Phase. Pattern mirrors + "ShardedCompactStatusUpdater" + from the sharded CompactBackup PR.

Markers ( + "SnapshotRestored" + / + "CompactSettled" + ) bypass + "UpdateRestoreCondition" + (which unconditionally writes + "status.Phase = condition.Type" + ) via a private + "appendRestoreMarker" + helper. + "UpdateRestoreCondition" + itself is untouched.

A CompactBackup informer is wired into the Restore controller so that CompactBackup state changes wake referencing Restores (reverse lookup by + "spec.replicationConfig.compactBackupName" + ).

Backward compatibility

Existing PiTR / snapshot / volume-snapshot users are completely unaffected. The interception in + "syncRestoreJob" + triggers only when + "spec.mode == pitr && spec.replicationConfig != nil" + ; nil falls through to the original code path. New CRD fields are all optional with zero-value-equivalent-old-behavior semantics.

Known follow-ups (out of scope for this PR)

  • + "buildPiTRBaseArgs" + is a minimal stub — the full PiTR arg set (env vars, TLS volumes, secrets, resources) lives in + "restoreManager.makeRestoreJobWithMode" + and is not replicated yet. PR 2 (or a follow-up) must integrate, either by calling + "rm.makeRestoreJobWithMode" + and post-processing or by extracting a shared helper.
  • backup-manager CLI flag + "--replicationPhase" + , BR command + "--replication-storage-phase=N" + , + "--replication-status-sub-prefix=ccr" + , and + "--pitr-concurrency=1024" + — PR 2.
  • End-to-end integration tests requiring real internal BR — PR 2.

Code changes

  • Has Go code change
  • Has CI related scripts change

Tests

  • Unit test (handler state machine, gate, consistency check, no-op wrapper, label-based Job lookup, idempotence, controller informer reverse lookup, restore manager interception)
  • E2E test (deferred to PR 2 — gated on internal BR readiness)
  • Manual test
  • No code

Side effects

  • Breaking backward compatibility
  • Other side effects

RidRisR added 15 commits April 23, 2026 14:03
Continuation of the CompactBackup sharded PR. Defines operator-side
orchestration for replication restore: two-phase BR (snapshot + log)
with CompactBackup terminal state as the gate.

Differs from 2026-04-15 v2 design in five localized decisions:
- Phase/Condition names are intent-based: SnapshotRestore / LogRestore
  (status.phase), SnapshotRestored / CompactSettled (markers)
- ReplicationStep is string ("snapshot-restore" / "log-restore") not int
- No intermediate ReadyForLogRestore phase; two markers True is the gate
- statusSubPrefix removed from CRD, hardcoded to "ccr" constant
- Status write ownership: Option B (controller owns all Phase writes;
  backup-manager in phase-1 wrapped by ReplicationRestoreStatusUpdater),
  mirroring the ShardedCompactStatusUpdater pattern from the CompactBackup
  sharded PR. Eliminates the v2 transformation-layer hole where backup-
  manager's Running write during phase-1 would overwrite controller's
  Phase value.

Includes PR breakdown (types -> operator logic -> backup-manager CLI)
aligned with the "single spec / multi PR" strategy.
The original 3-PR breakdown split operator-side work into a types-only
PR and a logic PR. Types without consumers make for awkward review
("where does this field get used?"), and the logic PR is tightly
coupled to the types. Merge them: PR 1 is now the full operator-side
implementation (types + label + wrapper + handler + informer wire +
unit tests); PR 2 stays as backup-manager CLI + BR flags, which must
remain independent because it depends on kernel BR readiness and
integration testing.
…e path

Two post-implementation-exploration corrections to the spec:

1. The Restore status updater interface has a single Update(restore,
   condition, newStatus) method, not the 6-method shape of the Compact
   status updater I originally mirrored. Rewrote the wrapper code block
   as a simple no-op updater and spelled out the tradeoffs of dropping
   TimeStarted/Progress/CommitTs writes during phase-1.

2. UpdateRestoreCondition at pkg/apis/pingcap/v1alpha1/restore.go:73
   unconditionally assigns status.Phase = condition.Type. So the
   controller cannot use it to write SnapshotRestored / CompactSettled
   markers without overwriting Phase. Documented that markers go through
   a handler-internal appendRestoreMarker helper which touches only
   status.Conditions[], preserving the "UpdateRestoreCondition is
   untouched" guarantee.
10 bite-sized tasks covering the operator-side work: CRD types and
constants, label additions, no-op status wrapper, handler skeleton
with ReplicationStep-keyed dispatch, helpers (appendRestoreMarker
that preserves Phase, CompactBackup lookup with WaitTimeout,
cross-CR consistency check), the three state-machine cases (initial /
snapshot-restore / log-restore), restoreManager interception, and
Restore controller CompactBackup informer wire-up. Each task follows
TDD: failing test -> implementation -> passing test -> commit.

PR 2 (backup-manager CLI + BR --replication-storage-phase) is
covered by a separate plan because it gates on internal BR
readiness and runs against a different test harness.
Adds ReplicationConfig struct (CompactBackupName, WaitTimeout) under
RestoreSpec.ReplicationConfig, RestoreStatus.ReplicationStep string,
and four new RestoreConditionType constants: SnapshotRestore and
LogRestore (phase values) plus SnapshotRestored and CompactSettled
(markers). No behavior change in this commit; consumed by subsequent
handler and wrapper commits.
Labels are used to identify phase-1 vs phase-2 BR Jobs by selector
rather than name pattern, so renaming the Job doesn't break the
handler's idempotence checks.
In replication restore phase-1, backup-manager must not write
status.phase or conditions — the controller is the sole writer.
This wrapper drops all Update calls. Activated from the
backup-manager CLI in PR 2 via --replicationPhase=1; phase-2 uses
the real updater, matching the wrap-pattern shipped for CompactBackup
sharded mode.
Skeleton dispatches by Status.ReplicationStep (empty / snapshot-restore /
log-restore). Terminal restores are no-ops. Per-step logic arrives in
subsequent commits.
appendRestoreMarker writes markers to Conditions without touching Phase
(preserves SnapshotRestore / LogRestore phase values during marker
observation). lookupCompactBackup encapsulates late-binding with
WaitTimeout. checkCrossCRConsistency validates br.cluster/namespace and
storage location, ignoring secretName differences.
On empty ReplicationStep, the handler validates cross-CR consistency,
writes CompactSettled if CompactBackup is already terminal, advances
Phase to SnapshotRestore, sets ReplicationStep=snapshot-restore, and
creates the phase-1 BR Job with replication-step label. Late-binding
and waitTimeout are handled via lookupCompactBackup.
Handler observes phase-1 Job (writes SnapshotRestored marker or fails
the Restore on JobFailed) and CompactBackup (writes CompactSettled on
terminal state). When both markers are True, transitions Phase to
LogRestore, sets ReplicationStep=log-restore, and creates the phase-2
BR Job.
In log-restore, backup-manager drives status.phase via the real
updater (no wrap in phase-2). Handler only guards against Job-level
failures that terminate before backup-manager writes a terminal
condition. makeReplicationBRJob builds phase-specific BR Jobs with
the correct labels and passes --replicationPhase={1,2} to
backup-manager (consumed in PR 2).

buildPiTRBaseArgs is intentionally a minimal stub: full PiTR arg
parity (env vars, TLS volumes, secrets) lives in
restoreManager.makeRestoreJobWithMode and integration with that
helper is deferred to a follow-up after Task 9 wires the handler
to the restoreManager.
When Spec.Mode==pitr and Spec.ReplicationConfig!=nil, syncRestoreJob
delegates to replicationHandler.Sync. All legacy single-Job machinery
below the interception point is skipped; standard PiTR, snapshot, and
volume-snapshot paths are unchanged.
Registers a CompactBackup EventHandler on the Restore controller that
enqueues Restores whose Spec.ReplicationConfig.CompactBackupName
matches the changed CompactBackup. Closes the observation loop so that
a late-binding Restore unblocks as soon as its referenced CompactBackup
is created, and so that a Restore in SnapshotRestore gates promptly
when CompactBackup reaches terminal state.
…nsistency check

- makeReplicationBRJob now uses label.NewRestore().Instance().RestoreJob().Restore()
  + merges restore.Labels, matching the standard restore Job builder pattern.
  External tooling that filters Jobs by app.kubernetes.io/managed-by=restore-operator
  or instance label will now find replication Jobs.
- Added a doc comment in syncSnapshotRestore explaining that cross-CR
  consistency is validated only once (at syncInitial), per spec section 5.
@ti-chi-bot ti-chi-bot Bot requested a review from shonge April 27, 2026 08:26
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jlerche for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the size/XXL label Apr 27, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 75.59682% with 92 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.77%. Comparing base (a16b2e8) to head (b6531ab).
⚠️ Report is 72 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6861      +/-   ##
==========================================
+ Coverage   57.19%   57.77%   +0.57%     
==========================================
  Files         259      263       +4     
  Lines       33233    38602    +5369     
==========================================
+ Hits        19008    22302    +3294     
- Misses      12291    14199    +1908     
- Partials     1934     2101     +167     
Flag Coverage Δ
unittest 57.77% <75.59%> (+0.57%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@RidRisR RidRisR force-pushed the docs/replication-restore-design branch from f6be1ed to e00a8ce Compare April 29, 2026 05:23
RidRisR added 7 commits April 29, 2026 07:53
…ally

Phase and ReplicationStep were written in two separate reconciles, leaving
a transient (Phase=LogRestore, Step=snapshot-restore) window observable to
external watchers and requiring two catch-up branches plus a bespoke
setReplicationStep with its own lister-staleness workaround.

Extend RestoreUpdateStatus with ReplicationStep so enterSnapshotRestore and
enterLogRestore can write Phase + condition + Step in a single Update.
Delete setReplicationStep and the two protocol-only catch-up branches; the
two step handlers now dispatch on Phase with no transient skew to absorb.

Align FakeRestoreConditionUpdater with the production path so future tests
that rely on RestoreUpdateStatus patches see the same semantics.
Job lookup in the replication handler used a label selector keyed on
restore.Name, which could mistake a leftover Job from a deleted same-name
predecessor (still GC'ing) for the current Restore's Job, and the lister-
List + Create pattern surfaced spurious AlreadyExists errors during fast
back-to-back reconciles.

makeReplicationBRJob now stamps the owning Restore's UID on each Job; both
ensureJobForStep and findJobForStep switch to Get-by-name and verify the
UID label. UID mismatch is treated as not-found by findJobForStep and as
an explicit "awaiting GC" error by ensureJobForStep, so Phase is never
advanced based on a stale predecessor's Job conditions.
The cross-CR consistency check between Restore and CompactBackup compared
only bucket/prefix/region for S3 (and bucket/prefix for GCS, container/
prefix for Azblob). Two CRs could share those fields while pointing at
different physical backends — e.g. same bucket name on AWS vs an internal
Ceph/MinIO cluster, or the same path under different GCS projects — which
would silently let phase-2 BR consume compacted metadata produced against
a different storage system.

Add the missing location-defining fields (S3 Provider/Endpoint/Path,
GCS ProjectId/Location/Path, Azblob StorageAccount/Path) to the compare,
keep credentials and upload-only attributes (SecretName, SasToken,
StorageClass, Acl, SSE, AccessTier) ignored.
- compactIsTerminal: warn on unrecognized CompactBackup state and
  explicitly enumerate known non-terminal states, so a future kernel
  state surfaces in logs instead of silently stalling the gate.
- lookupCompactBackup: document that WaitTimeout is measured from
  Restore.CreationTimestamp, not from the first NotFound observation.
- failRestore: document that ReplicationStep is intentionally preserved
  on terminal failure so triage can tell phase-1 vs phase-2 failures
  apart at a glance.
- Drop the B<n>_ cascade-branch prefix from test names and the matching
  section headers; cross-references to spec branches inside test bodies
  remain as documentation anchors.
Sentinel 0 means "not a replication restore" (CLI flag unset);
non-zero values 1/2 are the only other legal values. Range
validation lives in the cmd package (Task 2).

Also adds the PR 2 implementation plan.
…dation

Sentinel 0 = "not a replication restore" (cobra default when flag is
absent). Reject {-N, >=3} early in runRestore so the operator gets a
clean error instead of silently mishandling the value downstream.
Symmetric to compact.go's ShardedCompactStatusUpdater wrap, but driven
by the CLI flag (not CR.Spec) because the same Restore CR runs through
backup-manager twice with different phases. selectRestoreStatusUpdater
returns the no-op wrapper for phase 1; phase 0 / 2 pass the real
updater through unchanged.
RidRisR added 3 commits April 30, 2026 09:54
Three BR flags are appended when --replicationPhase > 0:
--replication-storage-phase, --replication-status-sub-prefix=ccr,
--pitr-concurrency=1024. Constants live in restore.go (BR call
details; not exposed via API per spec §6).

restoreData itself takes only a one-line edit to call the new
replicationBRFlags helper; phase 0 returns nil so append is a no-op
for standard PiTR. The 200-line restoreData body is otherwise
unchanged.
Reorder so pm.Enable runs before the replication interception. Both
PiTR variants (standard PiTR and replication restore) now wait for
TiKV's gc.ratio-threshold = -1 override to propagate via ConfigMap
before any BR Job is launched.

The override itself is written by tikvMemberManager.applyPiTRConfigOverride
in the TiKV reconcile loop, which already covers replication restores
because they satisfy Spec.Mode == PiTR. No TiKV-side change is needed.

The replication routing test is updated to seed a -1 ConfigMap up
front; a new test exercises the gating behavior end-to-end (closed
gate => handler not called; open gate => handler called once).
Replace the stub buildPiTRBaseArgs with an injected job builder
(production = rm.makeRestoreJobWithMode) plus a small
applyReplicationPhase post-processor. Phase-1 and phase-2 Jobs now
inherit the full PiTR Job setup: TiDBBackupManagerImage main
container, BR-binary init container with br-bin shared volume, TLS
volume mounts, password / storage env, owner reference. Replication
adds three mutations on top: Job name override, replication-step +
restore-UID labels, --replicationPhase arg.

NewRestoreManager is two-stepped so it can pass a function reference
to rm.makeRestoreJobWithMode into newReplicationHandler. The handler
takes a jobBuilderFunc as a constructor param; tests inject a small
fake builder via newHandlerForTest.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants