Skip to content

feat: add clean_broken_retention CLI command#1371

Open
Slach wants to merge 11 commits into
masterfrom
clean_broken_retention
Open

feat: add clean_broken_retention CLI command#1371
Slach wants to merge 11 commits into
masterfrom
clean_broken_retention

Conversation

@Slach
Copy link
Copy Markdown
Collaborator

@Slach Slach commented May 13, 2026

Summary

New CLI command clean_broken_retention that walks top-level of remote path and object_disks_path and batch-deletes (with retry, via the existing BatchDeleter pipeline) every entry that is not present in the live BackupList and not matched by any --keep=<glob>.

  • Whitelist = live remote backups ∪ user-provided --keep globs (path.Match syntax, repeatable)
  • Dry-run by default; --commit performs the actual deletes
  • Reuses BackupDestination.RemoveBackupRemote (for path) and cleanBackupObjectDisks (for object_disks_path) — both already use batched delete with retry/exponential-backoff
  • Motivation: clean up orphans left behind by failed retention runs (e.g. the GCS 503 scenario from batching deletion failures need retry #1356) — the retry fix prevents new orphans, this command sweeps up existing ones

Usage

# preview what would be deleted
clickhouse-backup clean_broken_retention

# preserve extra backups via globs, then actually delete
clickhouse-backup clean_broken_retention --keep='prod-*' --keep='snapshot-2026-??-*' --commit

Test plan

  • CI green (unit + integration)
  • Manual dry-run on a bucket containing a known orphan — verify it is listed without being deleted
  • Manual --commit run — verify the orphan is removed from both path and object_disks_path, batched, with retry on transient 5xx

Slach added 2 commits May 13, 2026 23:24
Walks top-level of remote `path` and `object_disks_path` and batch-deletes
(with retry, via the existing BatchDeleter pipeline) every entry that is
not present in the live BackupList and not matched by any --keep=<glob>.

Dry-run by default; --commit performs the deletes. Useful for cleaning
up orphans left behind by failed retention runs (e.g. the GCS 503
scenario from #1356).
Plant orphans directly under /minio/data/clickhouse for both `path`
and `object_disks_path`, then verify:
  - dry-run lists them without deleting,
  - --commit removes them from both locations,
  - --keep=<glob> preserves matching entries,
  - the live backup is never touched.
@Slach Slach modified the milestone: 2.7.0 May 13, 2026
Slach added 9 commits May 13, 2026 23:51
…ckends

Table-driven, with one sub-test per backend:
  - S3 (minio), SFTP (sshd), FTP (proftpd), GCS_EMULATOR (fake-gcs-server)
    use direct container-FS plant/assert,
  - AZBLOB plants via az-cli docker run against azurite,
  - real GCS and COS skip themselves unless GCS_TESTS / QA_TENCENT_SECRET_KEY
    is set (their plant helpers fail loudly if reached without infra).

For each backend: create a real backup that must survive, plant 3 orphans
in path/object_disks_path, verify dry-run lists them without deleting,
verify --commit + --keep glob deletes only unmatched orphans, then a
second --commit clears the rest.
…roken_retention

GCS: spins up google/cloud-sdk:slim with --volumes-from clickhouse-backup
to access /etc/clickhouse-backup/credentials.json, authenticates the
service account, and uses gsutil cp/ls.

COS: spins up amazon/aws-cli:latest with --endpoint-url pointing at the
regional COS endpoint and QA_TENCENT_SECRET_ID/KEY as AWS credentials,
since COS supports the S3 API. Setup hook renders config.yml from the
copied template via envsubst.

Both run only when their credential env var is set.
- Collapse the three container-FS factories (S3/SFTP/FTP/GCS_EMULATOR)
  into a single containerFSCase helper.
- Extract dockerRunSh and an azList closure to remove repetition in the
  AZBLOB case; reuse a single gsutil/awsRun wrapper for GCS and COS.
- Drop the redundant storageType field; derive the table name from name.
- Loop the plant+assertExists step over a small table instead of
  open-coding four calls.
- Pull repeated keep-glob string into a const, drop fmt.Sprintf with no
  format args.
Signed-off-by: slach <bloodjazman@gmail.com>
…ash in Walk names

The command previously left orphans untouched because:
1. `BackupList(ctx, false, ...)` uses the on-disk metadata cache; on the
   second invocation (e.g. dry-run → --commit) the cache returns orphan
   directories with Broken="", so my Broken!="" filter let them into the
   keep-set and zero orphans were detected. Now passes parseMetadata=true
   so every top-level entry is stat'd for metadata.json on each call.
2. `bd.Walk("/", false, …)` emits names with a leading slash on S3 (from
   TrimPrefix mismatch); the previous `strings.Contains(name, "/")` filter
   rejected them as nested. Switched to strings.Trim("/") so leading and
   trailing slashes are both stripped before the top-level check.

test/integration:
- Split TestCleanBrokenRetention into one top-level test per backend so
  each can be run independently via RUN_TESTS=TestCleanBrokenRetention<Backend>.
- S3 case now plants orphans via 'mc cp' instead of raw file writes —
  MinIO only sees objects placed through its S3 API; direct files on disk
  are invisible to ListObjectsV2 and the cleanup batch finds nothing.
Signed-off-by: slach <bloodjazman@gmail.com>
…lean_broken_retention

# Conflicts:
#	test/testflows/clickhouse_backup/tests/snapshots/cli.py.cli.snapshot
Signed-off-by: slach <bloodjazman@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant