-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Version: 26.1.4.20001 (altinity build)
Introduced by: PR #1402
How to run the test:
./regression.py --clickhouse https://altinity-build-artifacts.s3.amazonaws.com/REFs/antalya-26.1/9b978b90baa1fd20e917a17784f8ec8c22265fd3/build_amd_release/clickhouse-common-static_26.1.4.20001.altinityantalya_amd64.deb --clickhouse-version 26.1.4.20001 -l test.log --storage minio --only "/s3/minio/export tests/export part/concurrent alter/during minio interruption/*" --as-binary
Summary
When ALTER TABLE ... EXPORT PART ... TO TABLE targets an S3-backed table and the S3 endpoint becomes unreachable, the background export tasks retry internally for an extremely long time (~50 minutes each), consuming all background executor slots. No new export operations can be scheduled until the stuck tasks complete. Additionally, DROP TABLE on the source table hangs while tasks are in flight.
What we were testing
The test validates that concurrent ALTER TABLE ... EXPORT PART operations behave correctly when the S3 destination (MinIO) is interrupted mid-operation. This is a network resilience scenario — the expectation is that export operations should either fail promptly or recover once the destination comes back, not permanently block the executor.
Test procedure
- Create a partitioned MergeTree source table with 5 partitions (10 parts total)
- Create an S3-backed destination table pointing to MinIO
- Kill the MinIO container (
docker kill --signal=KILL) - Export all 10 parts sequentially via
ALTER TABLE ... EXPORT PART - Start MinIO back up and verify data
What happened
Phase 1 — Parts accepted, then rejected (18:07:07)
9 parts were accepted into the background executor. The 10th was immediately rejected:
# clickhouse-server.log — 9 parts accepted in rapid succession (~300ms window)
18:07:07.172 <Debug> executeQuery: ALTER TABLE source_... EXPORT PART '1_1_1_0' TO TABLE s3_... (stage: Complete)
18:07:07.207 <Debug> executeQuery: ALTER TABLE source_... EXPORT PART '1_2_2_0' TO TABLE s3_... (stage: Complete)
18:07:07.248 <Debug> executeQuery: ALTER TABLE source_... EXPORT PART '2_3_3_0' TO TABLE s3_... (stage: Complete)
18:07:07.276 <Debug> executeQuery: ALTER TABLE source_... EXPORT PART '2_4_4_0' TO TABLE s3_... (stage: Complete)
18:07:07.310 <Debug> executeQuery: ALTER TABLE source_... EXPORT PART '3_5_5_0' TO TABLE s3_... (stage: Complete)
18:07:07.349 <Debug> executeQuery: ALTER TABLE source_... EXPORT PART '3_6_6_0' TO TABLE s3_... (stage: Complete)
18:07:07.382 <Debug> executeQuery: ALTER TABLE source_... EXPORT PART '4_7_7_0' TO TABLE s3_... (stage: Complete)
18:07:07.412 <Debug> executeQuery: ALTER TABLE source_... EXPORT PART '4_8_8_0' TO TABLE s3_... (stage: Complete)
# ^^^ 8 background threads now occupied
# 10th part — rejected immediately (1ms later):
18:07:07.468 <Error> executeQuery: Code: 236. DB::Exception: Failed to schedule export part task
for data part '5_10_10_0'. Background executor is busy. (ABORTED)
From this point on, every retry of 5_10_10_0 was rejected — 1960+ times over 5 minutes.
Phase 2 — Background threads stuck in S3 retries
All 8 executor threads entered ExportPartTask::executeStep() and began retrying S3 uploads against the dead MinIO. The S3 client is configured for 501 retries, each timing out after ~6 seconds:
# clickhouse-server.log — S3 retries begin immediately
18:07:07.673 [ 2188 ] S3ClientRetryStrategy: Attempt 1/501 failed with retryable error: ...,
Timeout: connect timed out: 172.21.0.8:9001
18:07:07.708 [ 2180 ] S3ClientRetryStrategy: Attempt 1/501 failed with retryable error: ...,
Timeout: connect timed out: 172.21.0.8:9001
18:07:07.750 [ 2185 ] S3ClientRetryStrategy: Attempt 1/501 failed with retryable error: ...,
Timeout: connect timed out: 172.21.0.8:9001
# ~4 minutes later, still retrying:
18:11:28.110 [ 2188 ] S3ClientRetryStrategy: Attempt 50/501 failed ...
18:11:28.160 [ 2180 ] S3ClientRetryStrategy: Attempt 50/501 failed ...
# ~10 minutes later, server is killed while tasks are only at attempt 108/501:
18:17:15.290 [ 2180 ] S3ClientRetryStrategy: Attempt 108/501 failed ...
18:17:15.290 [ 2188 ] S3ClientRetryStrategy: Attempt 108/501 failed ...
8 threads, 8 parts, all stuck in parallel — each would need 501 retries * ~6s = ~50 minutes to drain:
| Part | Thread | Stuck in |
|---|---|---|
1_1_1_0 |
2188 | ExportPartTask::executeStep() → S3 PutObject retry loop |
1_2_2_0 |
2180 | same |
2_3_3_0 |
2185 | same |
2_4_4_0 |
2187 | same |
3_5_5_0 |
2183 | same |
3_6_6_0 |
2184 | same |
4_7_7_0 |
2186 | same |
4_8_8_0 |
2182 | same |
Phase 3 — Server shutdown, DROP TABLE hung
The server was shut down at 18:17:19 while tasks were still at attempt ~108/501. Prior to that, DROP TABLE on the source table hung for 300s because the background tasks held references to it:
# clickhouse-server.log — shutdown sequence while tasks still active
18:17:19.266 <Debug> Context: Shutting down merges executor
18:17:19.266 <Debug> Context: Shutting down fetches executor
18:17:19.266 <Debug> Context: Shutting down moves executor
18:17:19.266 <Debug> Context: Shutting down common executor
Root cause
Three issues combine to produce this failure:
-
No cancellation mechanism for in-flight export tasks. Once an
ExportPartTaskis scheduled, it cannot be cancelled — not by the client, not byDROP TABLE, and not by any timeout. PR Improvements to partition export #1402 added anisCancelled()check beforeexec.execute(), but the S3 retry loop runs insideexec.execute()and does not check the cancellation flag between retries. -
Excessive S3 retry budget. The
S3ClientRetryStrategyallows 501 retries with ~6-second connect timeouts per retry, meaning a single stuck task blocks a background thread for ~50 minutes. -
Hard rejection when executor is full. New export requests get
Code: 236 (ABORTED)with no option to queue or wait, so the subsystem is completely unavailable until the stuck tasks drain.
Impact
- The entire
EXPORT PARTsubsystem becomes unavailable for up to ~50 minutes after a transient S3 outage DROP TABLEon affected tables hangs indefinitely while tasks are in flight- No user-facing way to cancel stuck export tasks or reclaim executor slots
Reproducibility
Deterministic - reproduced on two consecutive runs with identical behavior.
PR #1402 — "Improvements to partition export" (merged 2026-03-07)
This PR introduced the bug. The critical change is in MergeTreeData::exportPartToTable(), which rewrote how export part tasks are scheduled:
Before #1402 — export parts used a lazy, trigger-based model. exportPartToTable() added a manifest to the set and called background_moves_assignee.trigger(). The background assignee would later pick up unprocessed manifests in scheduleDataMovingJob(), one at a time, interleaved with regular data-move jobs. This prevented executor saturation.
// OLD: just store manifest + trigger
export_manifests.emplace(std::move(manifest));
background_moves_assignee.trigger();After #1402 — exportPartToTable() creates the task eagerly and schedules it directly via scheduleMoveTask(). If the executor is full, it throws Code: 236 (ABORTED). Every ALTER TABLE ... EXPORT PART call immediately occupies a background executor slot, and rapid sequential calls saturate all slots before any task completes.
// NEW: create task + schedule immediately, throw if full
manifest.task = std::make_shared<ExportPartTask>(*this, manifest);
if (!background_moves_assignee.scheduleMoveTask(manifest.task))
{
export_manifests.erase(manifest);
throw Exception(ErrorCodes::ABORTED,
"Failed to schedule export part task for data part '{}'. Background executor is busy",
part_name);
}PR #1402 also removed the export-part scheduling from scheduleDataMovingJob() entirely — the old fallback loop that iterated export_manifests and scheduled idle ones was deleted. The only path to schedule an export task is now the inline path with no backpressure.