[feature](iceberg) Support reading Iceberg variant from Parquet#63192
[feature](iceberg) Support reading Iceberg variant from Parquet#63192eldenmoon wants to merge 1 commit into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
cab85b8 to
e9e3bfd
Compare
|
run buildall |
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was skipped (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
|
/review |
There was a problem hiding this comment.
Summary:
I found one blocking issue in the added regression test: the local single-BE copy path does not match the file_path used by the local TVF, so the test can fail in the exact environment that branch is intended to support.
Critical checkpoint conclusions:
- Goal: add Iceberg/Parquet VARIANT read support, including shredded projection. The implementation and regression coverage mostly target that goal, but the new regression test has a path setup bug.
- Scope: the production changes are focused on Parquet schema parsing, variant reconstruction, column pruning, and Iceberg type mapping.
- Concurrency/lifecycle: no new shared mutable concurrent state or non-obvious lifecycle ownership issue found in the reviewed PR diff.
- Configuration/compatibility: no new config items or persisted storage-format changes found; FE/BE type mapping paths for Iceberg VARIANT are updated.
- Parallel paths: Hive, Iceberg, and local Parquet pruning paths were considered; the test issue is distinct from production pruning logic.
- Tests: regression coverage was added, but the local-file staging logic can make the new test fail before validating the feature.
- Observability/performance: added ParquetReadColumnPaths profile string is useful for validating pruning; no blocking observability or hot-path issue found beyond the test blocker.
User focus: no additional user-provided review focus was present.
There was a problem hiding this comment.
I found a correctness blocker in the new shredded VARIANT pruning logic. The implementation prunes the unshredded value leaf whenever a matching typed_value path exists, but Iceberg shredded VARIANT can still carry residual/unrepresentable values for that field in value, so queries can silently return NULL or partial objects for those rows.
Critical checkpoint conclusions:
- Goal/test: the PR adds Iceberg v3 VARIANT reading and pruning tests, but the tests only cover fully typed shredded fields and do not prove residual fallback correctness.
- Scope/focus: the change is mostly focused on Parquet/Iceberg VARIANT support.
- Concurrency/lifecycle/config: no new concurrency, non-trivial lifecycle, or config behavior found in the reviewed PR diff.
- Compatibility: adds new type mapping; no storage-format persistence changes found.
- Parallel paths: the same pruning issue exists in both standalone/Hive Parquet and Iceberg Parquet helper paths.
- Tests: missing mixed shredded/residual cases where a selected typed path is absent or has an incompatible type in
typed_valuebut exists invalue. - Observability/performance: profile string helps inspect selected leaves; no additional blocking observability issue found.
- Data correctness: blocking issue below can cause incorrect query results after column pruning.
No additional user-provided focus points were present.
e9e3bfd to
5fe9ca5
Compare
|
run buildall |
|
/review |
1 similar comment
|
/review |
There was a problem hiding this comment.
Reviewed the full PR with the Doris code-review checklist. I found a blocking correctness issue in nested VARIANT pruning for deeper accesses under shredded fields. Goal/test: the PR adds Iceberg/Parquet VARIANT read support and includes a regression for top-level shredded field reads, but it does not cover deeper access through a shredded field residual. Scope: the change is focused, but the duplicated Hive/Iceberg pruning helper needs the same fix. Concurrency/lifecycle/config/transaction/persistence: no new concurrency, lifecycle, config, transaction, storage-format write, or persistence concerns found. Parallel paths: the same issue exists in both Hive/local Parquet and Iceberg Parquet helpers. Tests: existing tests cover unshredded reads and top-level shredded field pruning, but should add a mixed residual case such as v['metric']['x'] where metric is shredded and some rows store an object in typed_value.metric.value. Observability: the new profile string is useful for validating selected leaves. User focus: no additional user-provided review focus was supplied.
There was a problem hiding this comment.
Review result: request changes.
Critical checkpoint conclusions:
- Goal/test coverage: the PR adds Iceberg VARIANT Parquet reading, type mapping, pruning observability, and regression/unit coverage. The main scenario is covered, but a case-sensitive key path is not covered and currently regresses correctness.
- Scope/focus: the change is focused on Iceberg/Parquet VARIANT support, though duplicated Hive/Iceberg pruning helpers carry the same issue.
- Concurrency/lifecycle: no new shared mutable state, threads, locks, or static initialization hazards found in the reviewed paths.
- Configuration/compatibility: no new configs or storage-format writes; this is a reader/type-mapping change. Mixed files with non-VARIANT types continue through existing paths.
- Parallel paths: the Hive/local and Iceberg Parquet pruning paths were both reviewed; both have the same case-sensitivity bug and are commented separately.
- Error handling/memory: Status returns in the new reader path are generally propagated; no ignored Status or untracked large persistent allocation issue found beyond the correctness issue raised.
- Data correctness: blocking issue found: shredded VARIANT field lookup lowercases user path components, so distinct keys such as
aandAcan be pruned/read as the same field. - Observability/performance: the profile leaf-path observable is useful; no additional blocker found.
User focus: no additional user-provided review focus was specified.
5fe9ca5 to
fa098c0
Compare
|
run buildall |
|
/review |
TPC-H: Total hot run time: 29804 ms |
TPC-H: Total hot run time: 29785 ms |
FE UT Coverage ReportIncrement line coverage |
There was a problem hiding this comment.
I found two additional correctness issues in the Parquet VARIANT reconstruction path.
Critical checkpoint conclusions:
- Goal/test: The PR adds Iceberg v3 Parquet VARIANT reading and shredded column pruning, with regression and FE unit coverage for the common unshredded/shredded object paths. However, valid shredded array layouts and non-finite shredded floats are not covered and can decode incorrectly or fail.
- Scope/focus: The implementation is mostly focused on the new reader/pruning behavior, though the duplicated Hive/Iceberg pruning helper logic remains a maintainability risk rather than a blocker.
- Concurrency/lifecycle: The reviewed changes are per-reader/per-query state and do not introduce new shared mutable state, locks, background threads, or special static lifecycle dependencies.
- Config/compatibility: No new config items or storage-format writes are introduced. The change reads a standard Parquet/Iceberg format; mixed-version compatibility concerns are limited to reader capability.
- Parallel paths: Hive/local and Iceberg pruning paths are both updated. Standalone Parquet uses the Hive-style name pruning helper.
- Conditional checks: The main conditional logic around shredded
value/typed_valuefollows the Parquet Variant Shredding spec for object fields, but the array element check currently misses a spec-valid layout. - Tests/results: Existing tests cover top-level object shredding, deeper residual paths, case-sensitive keys, and profile observability. Missing coverage remains for arrays whose element group omits
value, and for NaN/Inf in shredded float/double typed values. - Observability: The added
ParquetReadColumnPathsprofile string is useful for pruning verification and appears lightweight. - Transactions/persistence/data writes: Not applicable; this is read-path only.
- FE/BE variable passing: Iceberg type mapping and access-path rewriting are updated for VARIANT; no additional thrift variable propagation issue found.
- Performance: The JSON reconstruction path is inherently allocation-heavy but limited to VARIANT decoding. No additional performance blocker found beyond the correctness issues below.
No additional user-provided review focus was specified.
TPC-DS: Total hot run time: 170353 ms |
TPC-DS: Total hot run time: 169901 ms |
fa098c0 to
bf0c548
Compare
|
/review |
|
run buildall |
There was a problem hiding this comment.
I found additional blocking compatibility issues for valid Parquet VARIANT shredding layouts that omit value when data is fully represented by typed_value. Existing review threads already cover non-finite typed floats, typed array elements, case-sensitive keys, and value/residual pruning, so I did not duplicate those.
Critical checkpoints:
- Goal/test: the PR aims to read Iceberg/Parquet VARIANT, including shredded layouts, and adds local TVF regression coverage, but coverage does not include typed-value-only top-level or nested shredded field groups.
- Scope/focus: the change is focused, but the schema/pruning logic is stricter than the Parquet shredding layout it is trying to support.
- Concurrency/lifecycle/config/transactions/persistence: no new concurrency, lifecycle, config, transaction, or persistence concerns found in the reviewed paths.
- Parallel paths: Hive/local and Iceberg pruning have duplicated logic; both need the typed-value-only fix.
- Compatibility/data correctness: current code rejects or prunes away valid typed-value-only shredded data, causing scan failure or null/missing results.
- Tests: existing tests cover unshredded/shredded happy paths and several pruning observables, but miss the typed-value-only layouts described in the inline comments.
- Observability/performance: no additional observability or performance blocker found beyond the added profile string being used by tests.
- User focus: no additional user-provided review focus was present.
TPC-H: Total hot run time: 29314 ms |
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
There was a problem hiding this comment.
I completed a full review of the PR diff and the existing review context. I did not find additional distinct inline issues beyond the already-open review threads, so I am not duplicating those comments. However, the existing threads still describe multiple blocking correctness risks in the VARIANT Parquet/Iceberg read and pruning paths, including missed whole-root VARIANT demands, false wrapper classification, residual/typed-value reconstruction gaps, MAP/ARRAY offset/null access handling, Iceberg field-id/path conversion edge cases, and test coverage gaps for reachable SQL/Iceberg table paths.
Critical checkpoint conclusions:
- Goal/test: the PR aims to support reading Iceberg VARIANT from Parquet with nested pruning and to reject unsupported Iceberg VARIANT writes. The implementation has broad tests, but existing review threads identify missing or unreachable coverage for several legal runtime paths.
- Scope/focus: the change is large and spans FE access-path collection, BE Parquet readers, Iceberg/Hive column-id pruning, write-path validation, and regressions; the size is understandable for the feature but increases risk.
- Concurrency/lifecycle: I did not identify a new distinct concurrency or lifecycle issue in the reviewed paths.
- Config/compatibility: no new config issue found. Compatibility concerns are mainly around Iceberg/Parquet path interpretation and write rejection behavior already called out.
- Parallel paths: existing threads cover several parallel-path gaps between local/Hive/Iceberg Parquet and FE/BE access path handling.
- Error handling/data correctness: existing threads include concrete data correctness and failure scenarios that can return pruned/missing VARIANT data or fail valid files.
- Tests/results: tests were added, but existing threads identify important missing coverage and at least some tests that do not exercise the intended reachable path.
- Observability/performance: no additional observability blocker found beyond the profile/test-state issue already raised.
- User focus: no additional user-provided review focus was supplied.
Given the unresolved correctness blockers already recorded inline, I cannot approve this PR yet.
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 31250 ms |
TPC-H: Total hot run time: 31098 ms |
|
/review |
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-DS: Total hot run time: 169133 ms |
There was a problem hiding this comment.
I found one remaining correctness issue in the FE access-path collection for VARIANT scalar consumers.
Critical checkpoint conclusions:
- Goal/test: the PR aims to support Iceberg/Parquet VARIANT reads and nested pruning; the new tests cover many projection/filter/pruning cases, but not scalar functions that consume a whole VARIANT while returning a non-VARIANT type.
- Scope/focus: the changes are broad across FE pruning and BE readers; the issue below is localized to access-path collection.
- Concurrency/lifecycle/config/compatibility: no new concurrency, lifecycle, config, or persisted-format concern was identified for this finding.
- Parallel paths: output VARIANT expressions, casts, generators, dynamic subscripts, and aliases have special handling, but ordinary non-VARIANT scalar functions over VARIANT inputs do not.
- Test coverage: missing regression coverage combining a whole-root scalar consumer such as variant_type(v) with a sibling nested predicate.
- Data correctness/performance/observability: the issue is data correctness; it can make scans read only a selected subpath even though the scalar function needs the root value. No extra observability concern beyond fixing/covering this path.
User focus: no additional user-provided review focus was specified.
TPC-DS: Total hot run time: 168985 ms |
|
run buildall |
|
/review |
TPC-H: Total hot run time: 30718 ms |
There was a problem hiding this comment.
I completed another full pass over the PR with the Doris code-review checklist and the existing review threads as prior context. I did not add new inline comments because the blocking issues I found are already covered by existing threads, and duplicating them would not add value.
Summary opinion: request changes until the existing correctness threads are resolved. The current change is a large cross-FE/BE feature around Iceberg/Parquet VARIANT reading, nested access-path propagation, and Iceberg write restrictions; the existing review context still identifies multiple data-correctness and runtime-failure risks in those paths.
Critical checkpoint conclusions:
- Goal/test proof: the PR aims to support Iceberg VARIANT reads from Parquet with pruning and to block unsupported VARIANT writes. The added tests cover important local/TVF/table-path cases, but existing threads identify several reachable SQL and Parquet-layout cases that are not yet proven.
- Scope/focus: the implementation is broad across FE access collection, BE Parquet pruning/reconstruction, Iceberg type mapping, and merge sinks; the size is understandable for the feature but increases integration risk.
- Concurrency/lifecycle: no distinct new concurrency, lock-order, or static-lifecycle issue found in this pass.
- Configuration/compatibility: no new config item issue found. The feature touches FE-BE thrift/schema contracts and Iceberg/Parquet format behavior; existing threads cover compatibility/contract gaps that should be resolved.
- Parallel code paths: existing threads already flag several parallel-path gaps across local/Hive/Iceberg Parquet, OLAP vs external scans, generator/project/alias propagation, and Iceberg table vs local TVF coverage.
- Special conditions: existing threads cover non-obvious conditions around VARIANT root reads, dynamic subscripts, NULL/OFFSET pseudo paths, residual value fallback, typed-only wrappers, and field-id conversion.
- Test coverage: coverage improved, but existing threads identify missing legal SQL-level, Iceberg table-path, residual/typed mixed, MAP/ARRAY offset/null, and generator/scalar-function cases.
- Results correctness: no new result-file issue found beyond existing test-state/profile and coverage concerns.
- Observability: profile assertions were added and are useful; no additional observability issue found in this pass.
- Transaction/persistence/data writes: no Doris transaction persistence issue found. Existing review context covers Iceberg delete-only MERGE/VARIANT write semantics.
- FE-BE variable passing: existing threads already cover access-path serialization/field-id and merge-sink contract risks; no distinct new instance found.
- Performance: existing OFFSET/null pruning issues cover the main performance/correctness risks I saw; no additional separate performance issue found.
User focus: no additional user-provided review focus was supplied.
|
/review |
|
run buildall |
TPC-DS: Total hot run time: 168514 ms |
|
run buildall |
|
/review |
There was a problem hiding this comment.
I completed another pass over the full PR at head 87a7378, including the existing review threads and the requested focus file. I did not add duplicate inline comments because the distinct blocking issues I could validate are already captured in the existing inline threads.
Summary opinion: request changes. The PR goal is clear: support reading Iceberg VARIANT data from Parquet and extend nested-column/access-path pruning across FE and BE paths. The implementation is broad and has useful tests, but the existing threads identify multiple remaining correctness blockers in access-path propagation and Parquet VARIANT reconstruction/pruning. In particular, several scenarios can prune away data that is still required for a projected VARIANT value, dynamic access, generator output, scalar whole-VARIANT consumers, MAP/ARRAY level-only reads, or residual shredded VARIANT values. These are data-correctness issues rather than minor style problems.
Critical checkpoint conclusions:
- Goal and tests: the feature goal is partially implemented, but current coverage is not sufficient for the failing end-to-end paths already called out, especially Iceberg table-path behavior and mixed projected/nested VARIANT access.
- Scope and clarity: the change is large and crosses FE access collection, BE Parquet reader/pruning, Iceberg/Hive table paths, and regression data. Some fixes remain too generic and miss expression-specific path semantics.
- Concurrency and lifecycle: I did not identify a new concurrency or lifecycle-specific issue in this pass.
- Configuration and compatibility: no new config was added. Iceberg/Parquet format compatibility remains the relevant concern; existing comments cover cases where accepted layouts are misread or fail at runtime.
- Parallel paths: several parallel paths still need consistent handling, including local/Hive/Iceberg Parquet, OLAP vs file scans, map keys vs values, and generator/scalar expression access propagation.
- Conditional checks and error handling: existing comments include cases where invalid physical column indexes or unsupported typed values can lead to crashes/failures instead of safe behavior.
- Test coverage: more coverage is required for the already-commented edge cases. The focus file had no additional user-provided focus points, so there were no extra focus-specific findings.
- Observability/performance: the intended pruning optimization is useful, but existing comments show some OFFSET/NULL-only paths still read unnecessary payloads or can crash. I did not find a separate observability issue.
Please address the active inline threads before this can be considered safe to merge.
### What problem does this PR solve? Issue Number: N/A Related PR: apache#63192 Problem Summary: Doris could not read Iceberg v3 VARIANT columns from Parquet files. This change maps Iceberg VARIANT to Doris VARIANT, validates the Parquet VARIANT wrapper shape from the VariantShredding spec, decodes unshredded metadata/value encoding, reads shredded typed_value columns, and prunes shredded Parquet leaf columns for accessed variant paths with profile observability. Typed-only shredded projections stay on native Parquet typed columns when residual value columns are not selected, while selected residual or complex layouts fall back to row-wise reconstruction. This also keeps VARIANT pruning independent from unsupported Doris nested VARIANT types, preserves ordinary non-VARIANT constructed struct pruning, treats missing shredded array elements as corruption per VariantShredding, keeps explicit Variant null array elements readable through a present wrapper, makes whole-VARIANT scalar consumers such as variant_type(v) force root access when sibling predicates only read subpaths, and rejects non-Parquet Iceberg VARIANT reads during scan planning. ### Release note Support reading Iceberg v3 VARIANT Parquet columns, including shredded typed_value column pruning and binary/UUID/primitive residual VARIANT values. Writing Iceberg VARIANT columns is rejected with an explicit unsupported error. ### Check List (For Author) - Test: Regression test / Unit Test / Manual test - Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.DirectTypedOnlyReaderCountersUseNativePath:ParquetVariantReaderTest.VariantReaderCountersUseRowWiseWhenResidualValueSelected:ParquetVariantReaderTest.RowWisePreservesExplicitVariantNullShreddedArrayElement:ParquetVariantReaderTest.RowWiseRejectsMissingShreddedArrayElement' (4 tests passed) - Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.*' (85 tests passed on rerun; the first attempt failed before tests in OpenBLAS CMake getarch bootstrap) - Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.*:NestedColumnAccessHelperTest.*' (127 tests passed) - Unit Test: ./run-be-ut.sh --run --filter='IcebergReaderCreateColumnIdsTest.*' (9 tests passed) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest (66 tests passed; Maven reactor succeeded before the final variant_type review fix) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest#testVariantTypeWholeExpressionWithPredicateAccessPath (1 test passed; Maven reactor succeeded. A prior full-class rerun after the final fix hit a local NoClassDefFoundError for TimeUtils before test assertions, then the targeted rerun succeeded after classes were regenerated.) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.source.IcebergScanNodeTest (5 tests passed; Maven reactor succeeded) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.VariantPruningLogicTest (11 tests passed; Maven reactor succeeded) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.IcebergUtilsTest (passed) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.SlotTypeReplacerTest (5 tests passed) - Regression test: performance regression coverage is included in regression-test/suites/external_table_p0/tvf/test_local_tvf_iceberg_variant.groovy, including profile assertions that typed-only projections increment VariantDirectTypedValueReadRows and keep VariantRowWiseReadRows at 0. Not run locally in this worktree because no local Doris cluster/output BE+FE runtime is available. - Regression test: Added regression-test/suites/external_table_p0/iceberg/test_iceberg_variant_table_path.groovy to exercise the Iceberg REST catalog table path with nested VARIANT access and profile read-column assertions. Not run locally because Docker access to spark-iceberg is unavailable in this worktree. - Manual test: PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:$PATH build-support/clang-format.sh - Manual test: PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:$PATH build-support/check-format.sh - Manual test: git diff --check - Manual test: cd fe && mvn -pl fe-core checkstyle:check -DskipTests - Static analysis: build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN was attempted on the current C++ diff. The default PATH had no clang-tidy; clang-tidy-16 could not analyze the changed BE files because included be/src/util/jni-util.h triggers pre-existing static_assert(false) analyzer errors; clang-tidy-20 with an explicit resource-dir was blocked by a pre-existing clang-tidy-nolint error in be/src/core/types.h. - Behavior changed: Yes. Doris can read Iceberg v3 VARIANT Parquet columns, supports typed-only shredded projection pruning on native typed columns, reconstructs selected residual or complex layouts row-wise, rejects malformed shredded array elements with missing payload wrappers, preserves explicit Variant null array elements, forces root access for whole-VARIANT scalar consumers such as variant_type(v), rejects Iceberg VARIANT reads from non-Parquet file formats during scan planning, and rejects Iceberg VARIANT data-file writes explicitly. - Does this need documentation: No
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found one additional blocking issue. The PR's main goal is Iceberg/Parquet VARIANT read support with nested access-path pruning while rejecting unsupported paths; the current code mostly targets that goal, but the ORC rejection is incomplete for nested VARIANT types.
Critical checkpoint conclusions:
- Goal/test coverage: top-level Iceberg ORC VARIANT rejection is tested, but nested VARIANT-in-complex ORC coverage is missing and the implementation does not reject it.
- Scope/focus: the change is broad but generally focused on VARIANT Parquet/Iceberg read/pruning and Iceberg write restrictions.
- Concurrency/lifecycle/config: I did not find new concurrency, lifecycle, or configuration risks in the reviewed paths.
- Compatibility: Iceberg VARIANT is newly exposed recursively; unsupported ORC read compatibility must fail early for all exposed shapes.
- Parallel paths: write validation recursively checks Iceberg VARIANT, but read validation only checks top-level Doris VARIANT slots.
- Data correctness/error handling: unsupported nested VARIANT ORC scans can reach BE instead of producing the intended planning error.
- Tests: add a nested complex-type ORC rejection test, for example
STRUCT<VARIANT>/ARRAY<VARIANT>/ map value VARIANT. - Observability/performance/transactions: no additional issue found beyond the unsupported-path validation gap.
User focus: no additional user-provided review focus was supplied.
|
|
||
| private void validateVariantReadSupported(String icebergFormat) throws DdlException { | ||
| for (SlotDescriptor slot : desc.getSlots()) { | ||
| if (slot.getColumn().getType().isVariantType()) { |
There was a problem hiding this comment.
This check only rejects slots whose top-level Doris type is VARIANT, but this PR maps Iceberg VARIANT recursively inside complex types. For an ORC Iceberg table with a column like STRUCT<v: VARIANT> (or ARRAY<VARIANT> / map value VARIANT), the slot type is struct/array/map, so isVariantType() is false and the scan reaches BE instead of failing with the Parquet-only error. Please make the read-side validation recursive over complex Doris types (or reuse the Iceberg schema traversal used for write validation) and add ORC coverage for nested VARIANT inside a complex column.
FE UT Coverage ReportIncrement line coverage |
There was a problem hiding this comment.
Automated review result: request changes.
Critical checkpoint conclusions:
- Goal/test coverage: The PR adds Iceberg/Parquet VARIANT read support, nested pruning, and write rejection. Several paths are covered, but two supported-looking cases remain unguarded or incorrect: ORC nested VARIANT validation and typed VARIANT arrays with null complex elements.
- Scope/focus: The implementation is generally focused on VARIANT Parquet/Iceberg support, but the read validation and reconstruction semantics need tightening before merge.
- Concurrency/lifecycle/config: I did not find new concurrency, lifecycle, or configuration hazards in the reviewed diff.
- Compatibility: Iceberg ORC must fail before reaching unsupported nested VARIANT reads; currently only top-level VARIANT is rejected.
- Parallel code paths: Parquet row-wise typed map reconstruction preserves null values, but typed array reconstruction treats missing/null complex elements as corruption.
- Tests: Additional coverage is needed for ORC Iceberg complex columns containing VARIANT and for typed VARIANT arrays containing null complex elements.
- Observability/transactions/persistence: Not applicable to the identified issues.
- Performance: No separate blocking performance issue found beyond existing review threads.
User focus points: No additional user-provided review focus was specified.
|
|
||
| private void validateVariantReadSupported(String icebergFormat) throws DdlException { | ||
| for (SlotDescriptor slot : desc.getSlots()) { | ||
| if (slot.getColumn().getType().isVariantType()) { |
There was a problem hiding this comment.
This validation only checks whether the top-level selected column is VARIANT, but this PR maps Iceberg VARIANT recursively inside complex types. An ORC Iceberg schema such as struct<s: variant> or array<variant> will pass here because the slot column type is STRUCT/ARRAY, then proceed into an unsupported non-Parquet VARIANT read path. Please recursively inspect the selected column type for nested VARIANTs and add ORC Iceberg coverage for complex columns containing VARIANT.
| PathInDataBuilder element_path; | ||
| RETURN_IF_ERROR(shredded_field_to_variant_map(element_schema, element, metadata, | ||
| &element_path, &element_values, | ||
| &element_present, string_values)); |
There was a problem hiding this comment.
This treats a null complex element in a typed VARIANT array as file corruption. For a valid value like [null, {"a":1}] shredded as typed_value: list<optional struct<a:int>>, the null element reaches shredded_field_to_variant_map() with present=false and fails here instead of preserving a null array element. The map path below already materializes absent values as Field(), so arrays need equivalent null-element handling rather than rejecting the row. Please add coverage for typed VARIANT arrays containing null complex elements.
TPC-H: Total hot run time: 31220 ms |
TPC-DS: Total hot run time: 169362 ms |
What problem does this PR solve?
Issue Number: N/A
Related PR: #63192
Problem Summary: Doris could not read Iceberg v3 VARIANT columns from Parquet files. This change maps Iceberg VARIANT to Doris VARIANT, validates the Parquet VARIANT wrapper shape from the VariantShredding spec, decodes unshredded metadata/value encoding, reads shredded typed_value columns, and prunes shredded Parquet leaf columns for accessed variant paths with profile observability. Typed-only shredded projections stay on native Parquet typed columns when residual value columns are not selected, while selected residual or complex layouts fall back to row-wise reconstruction. This also keeps VARIANT pruning independent from unsupported Doris nested VARIANT types, preserves ordinary non-VARIANT constructed struct pruning, treats missing shredded array elements as corruption per VariantShredding, keeps explicit Variant null array elements readable through a present wrapper, makes whole-VARIANT scalar consumers such as variant_type(v) force root access when sibling predicates only read subpaths, and rejects non-Parquet Iceberg VARIANT reads during scan planning.
Release note
Support reading Iceberg v3 VARIANT Parquet columns, including shredded typed_value column pruning and binary/UUID/primitive residual VARIANT values. Writing Iceberg VARIANT columns is rejected with an explicit unsupported error.
Check List (For Author)
Test: Regression test / Unit Test / Manual test
Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.DirectTypedOnlyReaderCountersUseNativePath:ParquetVariantReaderTest.VariantReaderCountersUseRowWiseWhenResidualValueSelected:ParquetVariantReaderTest.RowWisePreservesExplicitVariantNullShreddedArrayElement:ParquetVariantReaderTest.RowWiseRejectsMissingShreddedArrayElement' (4 tests passed)
Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.*' (85 tests passed on rerun; the first attempt failed before tests in OpenBLAS CMake getarch bootstrap)
Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.:NestedColumnAccessHelperTest.' (127 tests passed)
Unit Test: ./run-be-ut.sh --run --filter='IcebergReaderCreateColumnIdsTest.*' (9 tests passed)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest (66 tests passed; Maven reactor succeeded before the final variant_type review fix)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest#testVariantTypeWholeExpressionWithPredicateAccessPath (1 test passed; Maven reactor succeeded. A prior full-class rerun after the final fix hit a local NoClassDefFoundError for TimeUtils before test assertions, then the targeted rerun succeeded after classes were regenerated.)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.source.IcebergScanNodeTest (5 tests passed; Maven reactor succeeded)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.VariantPruningLogicTest (11 tests passed; Maven reactor succeeded)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.IcebergUtilsTest (passed)
Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.SlotTypeReplacerTest (5 tests passed)
Regression test: performance regression coverage is included in regression-test/suites/external_table_p0/tvf/test_local_tvf_iceberg_variant.groovy, including profile assertions that typed-only projections increment VariantDirectTypedValueReadRows and keep VariantRowWiseReadRows at 0. Not run locally in this worktree because no local Doris cluster/output BE+FE runtime is available.
Regression test: Added regression-test/suites/external_table_p0/iceberg/test_iceberg_variant_table_path.groovy to exercise the Iceberg REST catalog table path with nested VARIANT access and profile read-column assertions. Not run locally because Docker access to spark-iceberg is unavailable in this worktree.
Manual test: PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:$PATH build-support/clang-format.sh
Manual test: PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:$PATH build-support/check-format.sh
Manual test: git diff --check
Manual test: cd fe && mvn -pl fe-core checkstyle:check -DskipTests
Static analysis: build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN was attempted on the current C++ diff. The default PATH had no clang-tidy; clang-tidy-16 could not analyze the changed BE files because included be/src/util/jni-util.h triggers pre-existing static_assert(false) analyzer errors; clang-tidy-20 with an explicit resource-dir was blocked by a pre-existing clang-tidy-nolint error in be/src/core/types.h.
Behavior changed: Yes. Doris can read Iceberg v3 VARIANT Parquet columns, supports typed-only shredded projection pruning on native typed columns, reconstructs selected residual or complex layouts row-wise, rejects malformed shredded array elements with missing payload wrappers, preserves explicit Variant null array elements, forces root access for whole-VARIANT scalar consumers such as variant_type(v), rejects Iceberg VARIANT reads from non-Parquet file formats during scan planning, and rejects Iceberg VARIANT data-file writes explicitly.
Does this need documentation: No