Skip to content

[flink-action][server][client] add orphan files cleanup action for remote storage #3403

@platinumhamburg

Description

@platinumhamburg

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Fluss persists log segments and KV snapshots to remote storage (S3, OSS, HDFS, etc.). Over time, orphan files accumulate — files that consume storage costs but are no longer referenced by any live metadata. These originate from:

  1. Failed commits. Files written to remote storage but never registered in coordinator metadata (process crash, network partition).
  2. Interrupted deletions. Metadata removed but corresponding remote files not yet cleaned up due to transient failures or leader failover.
  3. Superseded manifests. Remote log manifests use upsert semantic — old manifest files are left behind after replacement.

Without a dedicated cleanup mechanism, orphan files grow monotonically. In production clusters with high write throughput, this can reach terabytes of wasted capacity within weeks.

Solution

Introduce a Flink-based orphan_files_clean action that safely identifies and deletes orphan remote files via a 3-stage DAG:

  • ScopeEnumerator (p=1): Queries coordinator via two new read-only RPCs (ListRemoteLogManifests, ListKvSnapshots) to build the active reference set. Detects orphan table/partition directories via ID guards. Emits per-bucket CleanTask items.
  • ScanAndClean (p=N): Walks log/kv directories on remote storage, applies rule-based file classification against the active set, and deletes orphan files with rate limiting.
  • StatsAggregate (p=1): Collects cleanup stats and performs a final empty-directory sweep.

CLI usage:

bin/flink run fluss-flink-action.jar orphan_files_clean \
    --bootstrap-server host:port \
    --database mydb \
    --dry-run \
    --older-than "2025-05-26 00:00:00" \
    --delete-rate-limit-per-second 100

Anything else?

No response

Willingness to contribute

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions