[WIP] Feature: Support TRUNCATE TABLE for Iceberg engine#1529
[WIP] Feature: Support TRUNCATE TABLE for Iceberg engine#1529il9ue wants to merge 11 commits intoantalya-26.1from
Conversation
This commit introduces native support for the TRUNCATE TABLE command for the Iceberg database engine. Execution no longer throws a NOT_IMPLEMENTED exception for DataLake engines. To align with Iceberg's architectural standards, this is a metadata-only operation. It creates a new snapshot with an explicitly generated, strictly typed empty Avro manifest list, increments the metadata version, and performs an atomic catalog update. File changes: - StorageObjectStorage.cpp: Remove hardcoded exception, delegate to data_lake_metadata->truncate(). - IDataLakeMetadata.h: Introduce supportsTruncate() and truncate() virtual methods. - IcebergMetadata.h/cpp: Implement the Iceberg-specific metadata truncation, empty manifest list generation via MetadataGenerator, and atomic catalog swap. - tests/integration/: Add PyIceberg integration tests. - tests/queries/0_stateless/: Add SQL stateless tests.
arthurpassos
left a comment
There was a problem hiding this comment.
I haven't implemented anything for Iceberg yet, tho it is in my todo list. I left two small comments for now.
I also looked at the transactional model and it looks ok (assuming I understood it correctly).
My understanding of the Iceberg + catalog transactional model is that updating the catalog is the commit marker, and if it fails, the transaction isn't complete even if the new metadata files were already uploaded. Those become orphan and must be ignored. This also implies an Iceberg table should always be read through a catalog if one exists, otherwise it becomes hard to determine the latest metadata snapshot.
I'll read the code more carefully later.
| { | ||
| throw Exception(ErrorCodes::NOT_IMPLEMENTED, | ||
| "Truncate is not supported for data lake engine"); | ||
| if (isDataLake()) |
There was a problem hiding this comment.
Isn't isDataLake() the same as the above configuration->isDataLakeconfiguration()? see
| virtual bool supportsParallelInsert() const { return false; } | ||
|
|
||
| virtual void modifyFormatSettings(FormatSettings &, const Context &) const {} | ||
| virtual void modifyFormatSettings(FormatSettings & /*format_settings*/, const Context & /*local_context*/) const {} |
There was a problem hiding this comment.
I would not change this method simply to avoid merge conflicts with upstream later on
41eff19 to
c1d252e
Compare
- Addressed Arthur's review comments (removed redundant isDataLake check, reverted IDataLakeMetadata.h signature). - Removed the stateless SQL test entirely. Iceberg table bootstrapping requires external catalog initialization, which is fully covered by the PyIceberg integration tests.
c1d252e to
601250f
Compare
|
@arthurpassos Looking at the latest CI pipeline, there are a couple of red checks remaining, but they appear to be unrelated infrastructure flakes/upstream regressions:
Since the core logical truncation architecture is in place and the specific Iceberg integration tests are passing, I believe this is ready for another look whenever you have the time. Let me know if you need any further adjustments! |
|
@codex review |
| persistent_components.metadata_compression_method, | ||
| persistent_components.table_uuid); | ||
|
|
||
| Int64 parent_snapshot_id = actual_table_state_snapshot.snapshot_id.value_or(0); |
There was a problem hiding this comment.
Why 0? I have seen "-1" in a few places https://github.com/ClickHouse/ClickHouse/blob/e3064cae1e27cafb79d37215a8d1d13a24c7801f/src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergWrites.cpp#L879 and https://github.com/ClickHouse/ClickHouse/blob/e3064cae1e27cafb79d37215a8d1d13a24c7801f/src/Databases/DataLake/RestCatalog.cpp#L1055.
Perhaps it is part of a specification?
| const auto & [namespace_name, table_name] = DataLake::parseTableName(storage_id.getTableName()); | ||
| bool success = catalog->updateMetadata(namespace_name, table_name, storage_metadata_name, metadata_object); | ||
| if (!success) | ||
| throw Exception(ErrorCodes::LOGICAL_ERROR, "Failed to commit Iceberg truncate update to catalog."); |
There was a problem hiding this comment.
I don't think it should be a LOGICAL_ERROR. This will crash ClickHouse
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5a7e014507
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (catalog) | ||
| { | ||
| const auto & [namespace_name, table_name] = DataLake::parseTableName(storage_id.getTableName()); | ||
| bool success = catalog->updateMetadata(namespace_name, table_name, storage_metadata_name, metadata_object); |
There was a problem hiding this comment.
Pass the generated snapshot to updateMetadata
truncate() builds new_snapshot but calls catalog->updateMetadata(..., metadata_object). For REST catalogs, RestCatalog::updateMetadata treats this argument as a snapshot and reads snapshot-id; table metadata objects do not have that field, so truncate fails before the request is sent. This makes TRUNCATE TABLE fail for Iceberg tables backed by REST catalogs.
Useful? React with 👍 / 👎.
| if (catalog) | ||
| { | ||
| const auto & [namespace_name, table_name] = DataLake::parseTableName(storage_id.getTableName()); | ||
| bool success = catalog->updateMetadata(namespace_name, table_name, storage_metadata_name, metadata_object); |
There was a problem hiding this comment.
Commit truncate with catalog-visible metadata path
The truncate commit passes storage_metadata_name to updateMetadata, but catalog updates need the catalog-facing metadata URI (as done in normal Iceberg writes). With Glue, this value is written to metadata_location; using an internal storage path instead of an s3://... URI leaves the catalog pointing at a non-resolvable location after truncate.
Useful? React with 👍 / 👎.
| persistent_components.metadata_compression_method, | ||
| persistent_components.table_uuid); | ||
|
|
||
| Int64 parent_snapshot_id = actual_table_state_snapshot.snapshot_id.value_or(0); |
There was a problem hiding this comment.
Keep no-parent sentinel when table has no snapshot
Using value_or(0) fabricates parent snapshot id 0 for tables that currently have no snapshot. Catalog commits then assert main is at snapshot 0 instead of using the existing no-parent sentinel (-1 used in other Iceberg write paths), so truncating a freshly created empty table can fail with optimistic-lock checks.
Useful? React with 👍 / 👎.
|
|
||
| # Assert PyIceberg reads the empty snapshot successfully | ||
| assert len(table.scan().to_arrow()) == 0 | ||
|
|
There was a problem hiding this comment.
Perhaps it is a good idea to insert data again and check it can be read just to make sure we haven't broken anything?
…log support)
## Overview
Implements metadata-only TRUNCATE TABLE for the Iceberg database engine,
targeting REST catalog (transactional) backends. Leaves physical file
garbage collection to standard Iceberg maintenance operations.
## Root Cause Analysis & Fixes (7 bugs)
### Bug 1: Premature isTransactional() guard blocked REST catalog
RCA: A guard was added that threw NOT_IMPLEMENTED for any transactional
catalog (i.e. REST), which is the primary target of this feature.
Fix: Removed the guard entirely. REST catalogs are fully supported.
### Bug 2: Catalog commit block was deleted
RCA: The catalog->updateMetadata() call was removed alongside the guard,
leaving object storage updated but the REST catalog pointer never atomically
swapped. The table appeared truncated locally but the catalog still pointed
to stale metadata.
Fix: Restored the catalog commit block, building the catalog-visible URI
(blob_type://namespace/metadata_name for transactional catalogs) consistent
with the pattern in IcebergWrites.cpp and Mutations.cpp.
### Bug 3: FileNamesGenerator hardcoded is_transactional=false
RCA: The refactored truncate path hardcoded false for the isTransactional
flag, causing FileNamesGenerator to produce bare /path/ style filenames
while the REST catalog expected full s3://... URIs. This triggered a
FileNamesGenerator::convertMetadataPathToStoragePath() consistency check
(BAD_ARGUMENTS: 'Paths in Iceberg must use a consistent format').
Fix: Force full URI base (from f_location) when catalog->isTransactional()
is true, regardless of write_full_path_in_iceberg_metadata setting.
### Bug 4: value_or(0) used wrong no-parent sentinel
RCA: Using 0 as the no-parent snapshot ID fabricates a fake parent snapshot
ID for tables with no existing snapshot. REST catalog optimistic-lock checks
assert the current snapshot matches; using 0 instead of -1 (the Iceberg
spec sentinel) causes lock check failures on empty tables.
Fix: Changed value_or(0) to value_or(-1).
### Bug 5: updateMetadata passed metadata_object instead of new_snapshot
RCA: RestCatalog::updateMetadata() reads snapshot-id (a long) from its 4th
argument to build the commit request. The truncate path passed metadata_object
(full table metadata JSON) which has no top-level snapshot-id field. Poco
threw InvalidAccessException: 'Can not convert empty value' (POCO_EXCEPTION).
Fix: Pass new_snapshot (the generated snapshot object) to updateMetadata,
consistent with how IcebergWrites.cpp and Mutations.cpp call it.
### Bug 6: Empty manifest list wrote field-id-less Avro schema header
RCA: avro::DataFileWriter calls writeHeader() eagerly in its constructor,
committing the binary encoder state. Attempting writer.setMetadata() after
construction to inject the full Iceberg schema JSON (with field-id on each
field) corrupted the encoder's internal StreamWriter::next_ pointer to NULL,
causing a segfault in avro::StreamWriter::flush() on close(). Without the
override, the Avro C++ library strips unknown field properties (like field-id)
when reconstructing schema JSON from its internal node representation, causing
PyIceberg to reject the manifest list with:
ValueError: Cannot convert field, missing field-id: {'name': 'manifest_path'}
Fix: For empty manifest lists (TRUNCATE path), bypass DataFileWriter entirely
and write a minimal valid Avro Object Container File manually. This embeds
the original schema_representation string (with all field-ids intact) directly
into the avro.schema metadata entry. The Avro container format is:
[magic(4)] [meta_map] [sync_marker(16)]
with no data blocks for an empty manifest list. This avoids all contrib
library issues without modifying vendored code.
### Bug 7: use_previous_snapshots=is_v2 copied old data into truncate snapshot
RCA: The generateManifestList() call in truncate passed is_v2 as the
use_previous_snapshots argument (true for v2 tables). This caused the
function to read and copy all manifest entries from the parent snapshot
into the new manifest list, defeating the truncation. PyIceberg then
returned 3 rows instead of 0.
Fix: Pass false explicitly for use_previous_snapshots in the truncate path.
Truncation must always produce an empty manifest list.
## Changes
- src/Databases/DataLake/RestCatalog.cpp
Improve error propagation: re-throw HTTP exceptions with detail text
instead of silently returning false from updateMetadata().
- src/Storages/ObjectStorage/DataLakes/Iceberg/Constant.h
Add field aliases: deleted_records (deleted-records) and
deleted_data_files (deleted-data-files) for truncate summary fields.
- src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergMetadata.cpp
Core truncate implementation: correct FileNamesGenerator construction,
correct parent snapshot sentinel (-1), restored catalog commit, manual
Avro container write for empty manifest list, correct use_previous_snapshots.
- src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergWrites.cpp
Fix updateMetadata() call to pass new_snapshot instead of metadata_object.
Add empty manifest list fast path with manual Avro container serialization.
- src/Storages/ObjectStorage/DataLakes/Iceberg/MetadataGenerator.cpp/.h
Add is_truncate parameter to generateNextMetadata(). When true, sets
operation=overwrite, zeroes out cumulative totals, and populates
deleted_records/deleted_data_files from parent snapshot summary for
spec-compliant truncate snapshot summary.
- src/Storages/ObjectStorage/DataLakes/Mutations.cpp
Fix updateMetadata() call to pass new_snapshot instead of metadata_object.
- tests/integration/test_storage_iceberg_no_spark/test_iceberg_truncate.py
New integration test: PyIceberg creates REST catalog table with 3 rows,
ClickHouse truncates via REST catalog, validates count=0 from both
ClickHouse and PyIceberg (cross-engine validation), then verifies table
remains writable by appending a new row.
## Test Results
19/19 passed across all backends (s3, azure, local) and all test cases
including the new test_iceberg_truncate.
RCA: IcebergStorageSink::initializeMetadata() was passing the full table
metadata object to catalog->updateMetadata(), but RestCatalog::updateMetadata()
expects a snapshot object with a top-level snapshot-id field. This caused
a Poco::InvalidAccessException ('Can not convert empty value') on every
INSERT into a REST catalog table.
Fix: Pass new_snapshot instead of metadata as the 4th argument, consistent
with the truncate path and Mutations.cpp.
[Feature] Support TRUNCATE TABLE for Iceberg Engine
Overview
As part of the Antalya release, v26.1 needs to natively support the
TRUNCATE TABLEcommand for the Iceberg database engine. Currently, upstream ClickHouse explicitly rejects this operation. As of PR ClickHouse#91713, executingTRUNCATEdown-casts toStorageObjectStorage, where it immediately throws anErrorCodes::NOT_IMPLEMENTEDexception for Data Lake engines.To support standard analytics workflows and testing pipelines without requiring users to
DROPand recreate tables (which breaks catalog bindings), implementing a metadata-only truncation is essential.Proposed Architecture
Unlike a standard MergeTree truncation that physically drops parts from the local disk, Iceberg truncation must be entirely logical. The implementation will leave physical file garbage collection to standard Iceberg maintenance operations and focus strictly on metadata manipulation.
Core Workflow:
StorageObjectStorage::truncateto checkdata_lake_metadata->supportsTruncate().v<N+1>.metadata.json).snapshotsarray, and update thesnapshot-logandcurrent-snapshot-id.ICataloginterface (e.g., REST Catalog) to point the table to the newly generated metadata JSON.Implementation Details
The required changes span the following internal abstractions:
src/Storages/ObjectStorage/StorageObjectStorage.h/cpp: Overridetruncateand remove the hardcodedthrow Exception. Delegate toIDataLakeMetadata.src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.h: IntroducesupportsTruncate()andtruncate(ContextPtr, ICatalog)virtual methods.src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergMetadata.h/cpp: Implement the core truncation logic. Must safely obtain anIObjectStoragewrite buffer viacontext->getWriteSettings()to serialize the empty Avro file before committing the JSON metadata.tests/integration/.../test_iceberg_truncate.py: Added Python integration tests. (Note: Stateless SQL tests are intentionally omitted as ClickHouseENGINE=Icebergrequires an externally initialized catalog to bootstrap, which is handled via PyIceberg in the integration suite).Acceptance Criteria
TRUNCATE TABLE ice_db.my_tablesucceeds without throwingNOT_IMPLEMENTED.SELECT count() FROM ice_db.my_tablereturns0immediately after truncation.v<N+1>.metadata.jsonis successfully written to the object storage warehouse.