[feature](iceberg) Support reading Iceberg variant from Parquet by eldenmoon · Pull Request #63192 · apache/doris

eldenmoon · 2026-05-12T19:26:18Z

What problem does this PR solve?

Issue Number: N/A

Related PR: #63192

Problem Summary: Doris could not read Iceberg v3 VARIANT columns from Parquet files. This change maps Iceberg VARIANT to Doris VARIANT, validates the Parquet VARIANT wrapper shape from the VariantShredding spec, decodes unshredded metadata/value encoding, reads shredded typed_value columns, and prunes shredded Parquet leaf columns for accessed variant paths with profile observability. Typed-only shredded projections stay on native Parquet typed columns when residual value columns are not selected, while selected residual or complex layouts fall back to row-wise reconstruction. This also keeps VARIANT pruning independent from unsupported Doris nested VARIANT types, preserves ordinary non-VARIANT constructed struct pruning, treats missing shredded array elements as corruption per VariantShredding, keeps explicit Variant null array elements readable through a present wrapper, makes whole-VARIANT scalar consumers such as variant_type(v) force root access when sibling predicates only read subpaths, and rejects non-Parquet Iceberg VARIANT reads during scan planning.

Release note

Support reading Iceberg v3 VARIANT Parquet columns, including shredded typed_value column pruning and binary/UUID/primitive residual VARIANT values. Writing Iceberg VARIANT columns is rejected with an explicit unsupported error.

Check List (For Author)

Test: Regression test / Unit Test / Manual test
- Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.DirectTypedOnlyReaderCountersUseNativePath:ParquetVariantReaderTest.VariantReaderCountersUseRowWiseWhenResidualValueSelected:ParquetVariantReaderTest.RowWisePreservesExplicitVariantNullShreddedArrayElement:ParquetVariantReaderTest.RowWiseRejectsMissingShreddedArrayElement' (4 tests passed)
- Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.*' (85 tests passed on rerun; the first attempt failed before tests in OpenBLAS CMake getarch bootstrap)
- Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.:NestedColumnAccessHelperTest.' (127 tests passed)
- Unit Test: ./run-be-ut.sh --run --filter='IcebergReaderCreateColumnIdsTest.*' (9 tests passed)
- Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest (66 tests passed; Maven reactor succeeded before the final variant_type review fix)
- Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest#testVariantTypeWholeExpressionWithPredicateAccessPath (1 test passed; Maven reactor succeeded. A prior full-class rerun after the final fix hit a local NoClassDefFoundError for TimeUtils before test assertions, then the targeted rerun succeeded after classes were regenerated.)
- Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.source.IcebergScanNodeTest (5 tests passed; Maven reactor succeeded)
- Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.VariantPruningLogicTest (11 tests passed; Maven reactor succeeded)
- Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.IcebergUtilsTest (passed)
- Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.SlotTypeReplacerTest (5 tests passed)
- Regression test: performance regression coverage is included in regression-test/suites/external_table_p0/tvf/test_local_tvf_iceberg_variant.groovy, including profile assertions that typed-only projections increment VariantDirectTypedValueReadRows and keep VariantRowWiseReadRows at 0. Not run locally in this worktree because no local Doris cluster/output BE+FE runtime is available.
- Regression test: Added regression-test/suites/external_table_p0/iceberg/test_iceberg_variant_table_path.groovy to exercise the Iceberg REST catalog table path with nested VARIANT access and profile read-column assertions. Not run locally because Docker access to spark-iceberg is unavailable in this worktree.
- Manual test: PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:$PATH build-support/clang-format.sh
- Manual test: PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:$PATH build-support/check-format.sh
- Manual test: git diff --check
- Manual test: cd fe && mvn -pl fe-core checkstyle:check -DskipTests
- Static analysis: build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN was attempted on the current C++ diff. The default PATH had no clang-tidy; clang-tidy-16 could not analyze the changed BE files because included be/src/util/jni-util.h triggers pre-existing static_assert(false) analyzer errors; clang-tidy-20 with an explicit resource-dir was blocked by a pre-existing clang-tidy-nolint error in be/src/core/types.h.
Behavior changed: Yes. Doris can read Iceberg v3 VARIANT Parquet columns, supports typed-only shredded projection pruning on native typed columns, reconstructs selected residual or complex layouts row-wise, rejects malformed shredded array elements with missing payload wrappers, preserves explicit Variant null array elements, forces root access for whole-VARIANT scalar consumers such as variant_type(v), rejects Iceberg VARIANT reads from non-Parquet file formats during scan planning, and rejects Iceberg VARIANT data-file writes explicitly.
Does this need documentation: No

hello-stephen · 2026-05-12T19:26:23Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

eldenmoon · 2026-05-12T19:26:46Z

run buildall

eldenmoon · 2026-05-12T19:28:54Z

/review

eldenmoon · 2026-05-12T19:31:23Z

run buildall

eldenmoon · 2026-05-12T19:31:25Z

/review

github-actions · 2026-05-12T19:33:32Z

OpenCode automated review failed and did not complete.

Error: Review step was skipped (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/25757447037

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

eldenmoon · 2026-05-12T19:35:52Z

/review

github-actions

Summary:

I found one blocking issue in the added regression test: the local single-BE copy path does not match the file_path used by the local TVF, so the test can fail in the exact environment that branch is intended to support.

Critical checkpoint conclusions:

Goal: add Iceberg/Parquet VARIANT read support, including shredded projection. The implementation and regression coverage mostly target that goal, but the new regression test has a path setup bug.
Scope: the production changes are focused on Parquet schema parsing, variant reconstruction, column pruning, and Iceberg type mapping.
Concurrency/lifecycle: no new shared mutable concurrent state or non-obvious lifecycle ownership issue found in the reviewed PR diff.
Configuration/compatibility: no new config items or persisted storage-format changes found; FE/BE type mapping paths for Iceberg VARIANT are updated.
Parallel paths: Hive, Iceberg, and local Parquet pruning paths were considered; the test issue is distinct from production pruning logic.
Tests: regression coverage was added, but the local-file staging logic can make the new test fail before validating the feature.
Observability/performance: added ParquetReadColumnPaths profile string is useful for validating pruning; no blocking observability or hot-path issue found beyond the test blocker.

User focus: no additional user-provided review focus was present.

github-actions

I found a correctness blocker in the new shredded VARIANT pruning logic. The implementation prunes the unshredded value leaf whenever a matching typed_value path exists, but Iceberg shredded VARIANT can still carry residual/unrepresentable values for that field in value, so queries can silently return NULL or partial objects for those rows.

Critical checkpoint conclusions:

Goal/test: the PR adds Iceberg v3 VARIANT reading and pruning tests, but the tests only cover fully typed shredded fields and do not prove residual fallback correctness.
Scope/focus: the change is mostly focused on Parquet/Iceberg VARIANT support.
Concurrency/lifecycle/config: no new concurrency, non-trivial lifecycle, or config behavior found in the reviewed PR diff.
Compatibility: adds new type mapping; no storage-format persistence changes found.
Parallel paths: the same pruning issue exists in both standalone/Hive Parquet and Iceberg Parquet helper paths.
Tests: missing mixed shredded/residual cases where a selected typed path is absent or has an incompatible type in typed_value but exists in value.
Observability/performance: profile string helps inspect selected leaves; no additional blocking observability issue found.
Data correctness: blocking issue below can cause incorrect query results after column pruning.

No additional user-provided focus points were present.

eldenmoon · 2026-05-12T20:03:47Z

run buildall

eldenmoon · 2026-05-12T20:03:48Z

/review

eldenmoon · 2026-05-12T20:05:58Z

/review

github-actions

Reviewed the full PR with the Doris code-review checklist. I found a blocking correctness issue in nested VARIANT pruning for deeper accesses under shredded fields. Goal/test: the PR adds Iceberg/Parquet VARIANT read support and includes a regression for top-level shredded field reads, but it does not cover deeper access through a shredded field residual. Scope: the change is focused, but the duplicated Hive/Iceberg pruning helper needs the same fix. Concurrency/lifecycle/config/transaction/persistence: no new concurrency, lifecycle, config, transaction, storage-format write, or persistence concerns found. Parallel paths: the same issue exists in both Hive/local Parquet and Iceberg Parquet helpers. Tests: existing tests cover unshredded reads and top-level shredded field pruning, but should add a mixed residual case such as v['metric']['x'] where metric is shredded and some rows store an object in typed_value.metric.value. Observability: the new profile string is useful for validating selected leaves. User focus: no additional user-provided review focus was supplied.

github-actions

Review result: request changes.

Critical checkpoint conclusions:

Goal/test coverage: the PR adds Iceberg VARIANT Parquet reading, type mapping, pruning observability, and regression/unit coverage. The main scenario is covered, but a case-sensitive key path is not covered and currently regresses correctness.
Scope/focus: the change is focused on Iceberg/Parquet VARIANT support, though duplicated Hive/Iceberg pruning helpers carry the same issue.
Concurrency/lifecycle: no new shared mutable state, threads, locks, or static initialization hazards found in the reviewed paths.
Configuration/compatibility: no new configs or storage-format writes; this is a reader/type-mapping change. Mixed files with non-VARIANT types continue through existing paths.
Parallel paths: the Hive/local and Iceberg Parquet pruning paths were both reviewed; both have the same case-sensitivity bug and are commented separately.
Error handling/memory: Status returns in the new reader path are generally propagated; no ignored Status or untracked large persistent allocation issue found beyond the correctness issue raised.
Data correctness: blocking issue found: shredded VARIANT field lookup lowercases user path components, so distinct keys such as a and A can be pruned/read as the same field.
Observability/performance: the profile leaf-path observable is useful; no additional blocker found.

User focus: no additional user-provided review focus was specified.

eldenmoon · 2026-05-12T20:43:50Z

run buildall

eldenmoon · 2026-05-12T20:43:51Z

/review

hello-stephen · 2026-05-12T20:44:27Z

TPC-H: Total hot run time: 29804 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e9e3bfd819ba9d8ccea3f9b57abb35147175a7e8, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17678	3922	3919	3919
q2	q3	10718	872	608	608
q4	4655	466	351	351
q5	7465	1357	1156	1156
q6	207	173	140	140
q7	940	940	750	750
q8	9516	1396	1309	1309
q9	6069	5362	5345	5345
q10	6307	2102	1891	1891
q11	486	266	259	259
q12	687	430	299	299
q13	18191	3334	2785	2785
q14	286	284	268	268
q15	q16	904	845	793	793
q17	958	995	742	742
q18	6501	5679	5525	5525
q19	1169	1192	1132	1132
q20	512	406	259	259
q21	4752	2425	1946	1946
q22	459	402	327	327
Total cold run time: 98460 ms
Total hot run time: 29804 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4662	4580	4561	4561
q2	q3	4703	4784	4197	4197
q4	2175	2189	1420	1420
q5	5212	4969	5227	4969
q6	202	176	143	143
q7	2063	1832	1618	1618
q8	3350	3117	3095	3095
q9	8461	8840	8379	8379
q10	4501	4555	4267	4267
q11	633	423	411	411
q12	738	748	524	524
q13	3188	3594	2917	2917
q14	302	300	290	290
q15	q16	755	778	723	723
q17	1379	1276	1282	1276
q18	8031	7128	7115	7115
q19	1158	1175	1154	1154
q20	2293	2287	1992	1992
q21	6276	5486	4800	4800
q22	531	473	400	400
Total cold run time: 60613 ms
Total hot run time: 54251 ms

hello-stephen · 2026-05-12T20:44:54Z

TPC-H: Total hot run time: 29785 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 5fe9ca52dd2edab6a76b7083da4f51d88076c8c5, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17624	3891	3801	3801
q2	q3	10699	879	587	587
q4	4666	467	347	347
q5	7460	1325	1139	1139
q6	201	169	142	142
q7	918	948	769	769
q8	9690	1438	1262	1262
q9	6853	5515	5462	5462
q10	6327	2070	1808	1808
q11	481	273	258	258
q12	688	422	301	301
q13	18192	3299	2725	2725
q14	298	282	264	264
q15	q16	898	868	787	787
q17	1197	1046	735	735
q18	6458	5747	5601	5601
q19	1610	1245	1108	1108
q20	521	414	416	414
q21	4731	2362	1928	1928
q22	464	395	347	347
Total cold run time: 99976 ms
Total hot run time: 29785 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4759	4521	4635	4521
q2	q3	4621	4757	4188	4188
q4	2159	2234	1421	1421
q5	4977	4949	5181	4949
q6	207	174	135	135
q7	2087	1822	1621	1621
q8	3333	3077	3141	3077
q9	8595	8393	8429	8393
q10	4534	4533	4220	4220
q11	588	417	419	417
q12	718	809	511	511
q13	3281	3632	2864	2864
q14	310	312	281	281
q15	q16	773	798	713	713
q17	1309	1304	1293	1293
q18	7986	7053	7094	7053
q19	1161	1199	1145	1145
q20	2214	2201	1934	1934
q21	6042	5384	4784	4784
q22	530	475	405	405
Total cold run time: 60184 ms
Total hot run time: 53925 ms

hello-stephen · 2026-05-12T20:45:35Z

FE UT Coverage Report

Increment line coverage 25.00% (1/4) 🎉
Increment coverage report
Complete coverage report

github-actions

I found two additional correctness issues in the Parquet VARIANT reconstruction path.

Critical checkpoint conclusions:

Goal/test: The PR adds Iceberg v3 Parquet VARIANT reading and shredded column pruning, with regression and FE unit coverage for the common unshredded/shredded object paths. However, valid shredded array layouts and non-finite shredded floats are not covered and can decode incorrectly or fail.
Scope/focus: The implementation is mostly focused on the new reader/pruning behavior, though the duplicated Hive/Iceberg pruning helper logic remains a maintainability risk rather than a blocker.
Concurrency/lifecycle: The reviewed changes are per-reader/per-query state and do not introduce new shared mutable state, locks, background threads, or special static lifecycle dependencies.
Config/compatibility: No new config items or storage-format writes are introduced. The change reads a standard Parquet/Iceberg format; mixed-version compatibility concerns are limited to reader capability.
Parallel paths: Hive/local and Iceberg pruning paths are both updated. Standalone Parquet uses the Hive-style name pruning helper.
Conditional checks: The main conditional logic around shredded value/typed_value follows the Parquet Variant Shredding spec for object fields, but the array element check currently misses a spec-valid layout.
Tests/results: Existing tests cover top-level object shredding, deeper residual paths, case-sensitive keys, and profile observability. Missing coverage remains for arrays whose element group omits value, and for NaN/Inf in shredded float/double typed values.
Observability: The added ParquetReadColumnPaths profile string is useful for pruning verification and appears lightweight.
Transactions/persistence/data writes: Not applicable; this is read-path only.
FE/BE variable passing: Iceberg type mapping and access-path rewriting are updated for VARIANT; no additional thrift variable propagation issue found.
Performance: The JSON reconstruction path is inherently allocation-heavy but limited to VARIANT decoding. No additional performance blocker found beyond the correctness issues below.

No additional user-provided review focus was specified.

hello-stephen · 2026-05-12T20:55:33Z

TPC-DS: Total hot run time: 170353 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e9e3bfd819ba9d8ccea3f9b57abb35147175a7e8, data reload: false

query5	4327	649	519	519
query6	333	223	206	206
query7	4254	565	306	306
query8	323	252	224	224
query9	8875	4094	4024	4024
query10	445	351	314	314
query11	5803	2358	2258	2258
query12	186	132	123	123
query13	1273	625	441	441
query14	6085	5342	5029	5029
query14_1	4355	4361	4357	4357
query15	216	201	181	181
query16	1000	450	394	394
query17	1112	764	607	607
query18	2510	480	345	345
query19	209	203	163	163
query20	138	132	131	131
query21	209	137	118	118
query22	13639	13556	13372	13372
query23	17199	16360	16093	16093
query23_1	16079	16237	16181	16181
query24	7431	1793	1365	1365
query24_1	1355	1390	1388	1388
query25	603	540	489	489
query26	1352	324	181	181
query27	2678	608	359	359
query28	4478	2023	2012	2012
query29	994	657	554	554
query30	313	246	202	202
query31	1118	1087	924	924
query32	87	75	79	75
query33	552	376	306	306
query34	1183	1132	666	666
query35	776	778	686	686
query36	1328	1361	1199	1199
query37	157	108	95	95
query38	3190	3171	3051	3051
query39	962	931	904	904
query39_1	894	867	877	867
query40	258	167	145	145
query41	71	67	68	67
query42	117	115	113	113
query43	326	331	302	302
query44	
query45	217	208	202	202
query46	1077	1200	725	725
query47	2281	2334	2185	2185
query48	406	416	292	292
query49	652	557	470	470
query50	722	300	223	223
query51	4324	4278	4321	4278
query52	109	105	98	98
query53	253	282	209	209
query54	327	293	268	268
query55	96	92	87	87
query56	311	334	329	329
query57	1413	1401	1298	1298
query58	319	281	269	269
query59	1526	1640	1377	1377
query60	351	352	340	340
query61	203	152	156	152
query62	665	607	563	563
query63	254	199	205	199
query64	2406	813	677	677
query65	
query66	1716	512	389	389
query67	30267	29983	29905	29905
query68	
query69	461	336	301	301
query70	994	930	966	930
query71	299	273	271	271
query72	2936	2803	2490	2490
query73	858	765	404	404
query74	5055	4967	4727	4727
query75	2765	2670	2323	2323
query76	2291	1120	735	735
query77	421	435	348	348
query78	12873	13050	12248	12248
query79	1506	942	760	760
query80	1364	579	501	501
query81	532	280	244	244
query82	1020	159	128	128
query83	318	284	247	247
query84	266	139	109	109
query85	894	550	432	432
query86	449	340	305	305
query87	3422	3352	3232	3232
query88	3518	2667	2652	2652
query89	442	388	338	338
query90	1906	186	184	184
query91	178	168	137	137
query92	80	80	73	73
query93	1102	940	560	560
query94	710	348	294	294
query95	655	466	350	350
query96	1061	731	346	346
query97	2717	2699	2591	2591
query98	240	238	228	228
query99	1131	1084	979	979
Total cold run time: 253877 ms
Total hot run time: 170353 ms

hello-stephen · 2026-05-12T20:55:48Z

TPC-DS: Total hot run time: 169901 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 5fe9ca52dd2edab6a76b7083da4f51d88076c8c5, data reload: false

query5	4329	655	519	519
query6	344	229	198	198
query7	4251	540	302	302
query8	334	234	206	206
query9	8828	4061	4005	4005
query10	447	353	316	316
query11	5828	2369	2201	2201
query12	178	128	128	128
query13	1284	654	423	423
query14	6612	5345	5024	5024
query14_1	4337	4351	4298	4298
query15	210	202	185	185
query16	1019	447	427	427
query17	1118	760	632	632
query18	2495	500	359	359
query19	218	208	159	159
query20	137	131	129	129
query21	217	138	115	115
query22	13679	13558	13470	13470
query23	17261	16299	15983	15983
query23_1	16164	16128	16226	16128
query24	7449	1767	1367	1367
query24_1	1373	1361	1361	1361
query25	593	546	492	492
query26	1294	333	175	175
query27	2716	614	341	341
query28	4469	1985	1985	1985
query29	1058	666	555	555
query30	311	250	203	203
query31	1131	1068	947	947
query32	93	81	78	78
query33	555	356	312	312
query34	1173	1145	626	626
query35	767	791	675	675
query36	1312	1361	1221	1221
query37	157	110	99	99
query38	3181	3151	3040	3040
query39	920	916	898	898
query39_1	885	881	874	874
query40	254	167	144	144
query41	76	69	69	69
query42	114	116	112	112
query43	320	329	287	287
query44	
query45	216	205	199	199
query46	1082	1209	710	710
query47	2332	2244	2129	2129
query48	408	421	324	324
query49	648	543	455	455
query50	694	284	219	219
query51	4281	4278	4175	4175
query52	104	108	97	97
query53	257	279	218	218
query54	334	285	280	280
query55	94	94	86	86
query56	321	317	340	317
query57	1405	1389	1338	1338
query58	314	284	280	280
query59	1540	1586	1379	1379
query60	362	358	345	345
query61	209	163	156	156
query62	675	623	559	559
query63	245	197	204	197
query64	2406	831	680	680
query65	
query66	1728	542	401	401
query67	30018	29869	29896	29869
query68	
query69	446	342	311	311
query70	1014	1021	971	971
query71	310	280	271	271
query72	3032	2707	2389	2389
query73	870	732	443	443
query74	5029	4890	4753	4753
query75	2763	2674	2344	2344
query76	2304	1109	742	742
query77	415	441	353	353
query78	12866	12893	12172	12172
query79	1554	932	750	750
query80	1362	566	529	529
query81	519	279	239	239
query82	963	162	119	119
query83	372	286	250	250
query84	268	138	113	113
query85	939	512	440	440
query86	491	337	346	337
query87	3477	3365	3223	3223
query88	3527	2653	2608	2608
query89	446	382	338	338
query90	1917	176	186	176
query91	181	170	138	138
query92	80	77	72	72
query93	1202	971	548	548
query94	715	345	298	298
query95	685	460	354	354
query96	1034	732	357	357
query97	2707	2674	2534	2534
query98	236	229	226	226
query99	1113	1129	1008	1008
Total cold run time: 254910 ms
Total hot run time: 169901 ms

eldenmoon · 2026-05-12T21:06:14Z

/review

eldenmoon · 2026-05-12T21:06:16Z

run buildall

github-actions

I found additional blocking compatibility issues for valid Parquet VARIANT shredding layouts that omit value when data is fully represented by typed_value. Existing review threads already cover non-finite typed floats, typed array elements, case-sensitive keys, and value/residual pruning, so I did not duplicate those.

Critical checkpoints:

Goal/test: the PR aims to read Iceberg/Parquet VARIANT, including shredded layouts, and adds local TVF regression coverage, but coverage does not include typed-value-only top-level or nested shredded field groups.
Scope/focus: the change is focused, but the schema/pruning logic is stricter than the Parquet shredding layout it is trying to support.
Concurrency/lifecycle/config/transactions/persistence: no new concurrency, lifecycle, config, transaction, or persistence concerns found in the reviewed paths.
Parallel paths: Hive/local and Iceberg pruning have duplicated logic; both need the typed-value-only fix.
Compatibility/data correctness: current code rejects or prunes away valid typed-value-only shredded data, causing scan failure or null/missing results.
Tests: existing tests cover unshredded/shredded happy paths and several pruning observables, but miss the typed-value-only layouts described in the inline comments.
Observability/performance: no additional observability or performance blocker found beyond the added profile string being used by tests.
User focus: no additional user-provided review focus was present.

hello-stephen · 2026-05-12T21:12:34Z

TPC-H: Total hot run time: 29314 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit fa098c01643af773e82d99b9046d891338e1145f, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17784	4016	3937	3937
q2	q3	10704	888	609	609
q4	4677	454	346	346
q5	7441	1317	1142	1142
q6	204	174	138	138
q7	917	948	757	757
q8	9628	1397	1270	1270
q9	6188	5397	5328	5328
q10	6323	2084	1806	1806
q11	477	262	256	256
q12	687	405	294	294
q13	18197	3296	2757	2757
q14	292	282	263	263
q15	q16	911	870	787	787
q17	1011	1038	727	727
q18	6402	5668	5519	5519
q19	1470	1164	958	958
q20	505	385	258	258
q21	4852	2276	1856	1856
q22	420	353	306	306
Total cold run time: 99090 ms
Total hot run time: 29314 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4269	4176	4162	4162
q2	q3	4627	4770	4167	4167
q4	2093	2152	1380	1380
q5	4961	4992	5266	4992
q6	193	163	133	133
q7	2028	1762	2018	1762
q8	3411	3196	3191	3191
q9	8519	8460	8332	8332
q10	4522	4465	4240	4240
q11	625	431	395	395
q12	732	743	535	535
q13	3224	3547	2922	2922
q14	306	314	290	290
q15	q16	763	781	720	720
q17	1356	1296	1272	1272
q18	8062	7007	7128	7007
q19	1173	1167	1127	1127
q20	2315	2241	1938	1938
q21	6202	5384	4915	4915
q22	567	533	438	438
Total cold run time: 59948 ms
Total hot run time: 53918 ms

eldenmoon · 2026-05-17T04:11:34Z

run buildall

hello-stephen · 2026-05-17T04:15:07Z

BE UT Coverage Report

Increment line coverage 74.61% (2556/3426) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.71% (20839/38801)
Line Coverage	37.40% (197597/528274)
Region Coverage	33.79% (154773/458107)
Branch Coverage	34.88% (67835/194480)

github-actions

I completed a full review of the PR diff and the existing review context. I did not find additional distinct inline issues beyond the already-open review threads, so I am not duplicating those comments. However, the existing threads still describe multiple blocking correctness risks in the VARIANT Parquet/Iceberg read and pruning paths, including missed whole-root VARIANT demands, false wrapper classification, residual/typed-value reconstruction gaps, MAP/ARRAY offset/null access handling, Iceberg field-id/path conversion edge cases, and test coverage gaps for reachable SQL/Iceberg table paths.

Critical checkpoint conclusions:

Goal/test: the PR aims to support reading Iceberg VARIANT from Parquet with nested pruning and to reject unsupported Iceberg VARIANT writes. The implementation has broad tests, but existing review threads identify missing or unreachable coverage for several legal runtime paths.
Scope/focus: the change is large and spans FE access-path collection, BE Parquet readers, Iceberg/Hive column-id pruning, write-path validation, and regressions; the size is understandable for the feature but increases risk.
Concurrency/lifecycle: I did not identify a new distinct concurrency or lifecycle issue in the reviewed paths.
Config/compatibility: no new config issue found. Compatibility concerns are mainly around Iceberg/Parquet path interpretation and write rejection behavior already called out.
Parallel paths: existing threads cover several parallel-path gaps between local/Hive/Iceberg Parquet and FE/BE access path handling.
Error handling/data correctness: existing threads include concrete data correctness and failure scenarios that can return pruned/missing VARIANT data or fail valid files.
Tests/results: tests were added, but existing threads identify important missing coverage and at least some tests that do not exercise the intended reachable path.
Observability/performance: no additional observability blocker found beyond the profile/test-state issue already raised.
User focus: no additional user-provided review focus was supplied.

Given the unresolved correctness blockers already recorded inline, I cannot approve this PR yet.

hello-stephen · 2026-05-17T04:23:33Z

FE UT Coverage Report

Increment line coverage 80.09% (177/221) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-05-17T04:26:52Z

TPC-H: Total hot run time: 31250 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ada105e913ae461cd4e108f3ab58c331fbb2e237, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17713	3858	3847	3847
q2	q3	10851	1380	809	809
q4	4683	474	337	337
q5	7806	2266	2089	2089
q6	273	176	144	144
q7	959	799	647	647
q8	9344	1793	1496	1496
q9	6453	4889	4897	4889
q10	6469	2101	1830	1830
q11	444	272	247	247
q12	691	426	296	296
q13	18180	3467	2749	2749
q14	270	253	236	236
q15	q16	817	800	706	706
q17	994	936	920	920
q18	6722	5697	5565	5565
q19	1180	1261	1182	1182
q20	529	436	289	289
q21	5978	2736	2654	2654
q22	452	382	318	318
Total cold run time: 100808 ms
Total hot run time: 31250 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4525	4518	4724	4518
q2	q3	4790	5282	4560	4560
q4	2147	2215	1398	1398
q5	4693	4751	4652	4652
q6	229	175	128	128
q7	1843	1723	1600	1600
q8	2351	1968	1869	1869
q9	7228	7248	7244	7244
q10	4501	4389	3971	3971
q11	522	375	350	350
q12	696	714	510	510
q13	2974	3437	2814	2814
q14	278	273	259	259
q15	q16	674	696	627	627
q17	1246	1237	1222	1222
q18	7466	6772	6969	6772
q19	1137	1098	1091	1091
q20	2212	2220	1949	1949
q21	5301	4619	4525	4525
q22	534	451	408	408
Total cold run time: 55347 ms
Total hot run time: 50467 ms

hello-stephen · 2026-05-17T04:29:30Z

TPC-H: Total hot run time: 31098 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8587400dc65063b8e8ea52ca286f3e7831788e4a, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17641	3885	3876	3876
q2	q3	10818	1414	816	816
q4	4686	477	354	354
q5	7580	2244	2105	2105
q6	240	177	139	139
q7	919	795	616	616
q8	9367	1565	1547	1547
q9	5106	4885	4865	4865
q10	6418	2054	1793	1793
q11	432	284	247	247
q12	628	416	302	302
q13	18107	3402	2787	2787
q14	261	247	245	245
q15	q16	821	780	711	711
q17	932	949	973	949
q18	6905	5827	5635	5635
q19	1320	1202	1105	1105
q20	514	390	265	265
q21	5823	2600	2429	2429
q22	448	363	312	312
Total cold run time: 98966 ms
Total hot run time: 31098 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4176	4101	4109	4101
q2	q3	4506	4897	4322	4322
q4	2103	2195	1392	1392
q5	4461	4274	4295	4274
q6	230	177	134	134
q7	1751	2026	1899	1899
q8	2542	2159	2173	2159
q9	8059	8084	7758	7758
q10	4739	4488	4066	4066
q11	568	410	373	373
q12	739	753	542	542
q13	3325	3631	2983	2983
q14	298	291	275	275
q15	q16	729	764	654	654
q17	1337	1298	1300	1298
q18	7948	7243	7334	7243
q19	1179	1169	1115	1115
q20	2205	2227	1913	1913
q21	5304	4634	4500	4500
q22	530	442	399	399
Total cold run time: 56729 ms
Total hot run time: 51400 ms

eldenmoon · 2026-05-17T04:30:28Z

/review

eldenmoon · 2026-05-17T04:30:28Z

run buildall

hello-stephen · 2026-05-17T04:31:46Z

BE UT Coverage Report

Increment line coverage 74.20% (2591/3492) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.68% (20827/38802)
Line Coverage	37.39% (197526/528340)
Region Coverage	33.76% (154698/458185)
Branch Coverage	34.87% (67830/194523)

hello-stephen · 2026-05-17T04:37:53Z

TPC-DS: Total hot run time: 169133 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ada105e913ae461cd4e108f3ab58c331fbb2e237, data reload: false

query5	4309	632	517	517
query6	348	235	206	206
query7	4273	555	294	294
query8	334	246	224	224
query9	8818	3983	3976	3976
query10	459	340	294	294
query11	5823	2365	2243	2243
query12	184	128	128	128
query13	1280	592	419	419
query14	5980	5334	5003	5003
query14_1	4326	4319	4319	4319
query15	206	206	182	182
query16	1026	441	346	346
query17	1134	718	576	576
query18	2458	497	340	340
query19	212	201	162	162
query20	136	133	133	133
query21	220	138	116	116
query22	13578	13463	13446	13446
query23	17123	16372	15995	15995
query23_1	16183	16054	16252	16054
query24	7430	1773	1302	1302
query24_1	1311	1336	1309	1309
query25	586	504	449	449
query26	1303	330	179	179
query27	2681	550	338	338
query28	4505	1955	1924	1924
query29	1017	646	528	528
query30	301	242	200	200
query31	1113	1078	938	938
query32	94	86	76	76
query33	558	364	302	302
query34	1171	1149	629	629
query35	758	783	681	681
query36	1343	1344	1163	1163
query37	153	106	104	104
query38	3209	3148	3069	3069
query39	935	917	885	885
query39_1	889	885	868	868
query40	235	153	130	130
query41	74	69	68	68
query42	116	113	108	108
query43	323	330	286	286
query44	
query45	220	208	198	198
query46	1072	1185	718	718
query47	2333	2346	2237	2237
query48	400	429	286	286
query49	646	537	384	384
query50	1061	369	256	256
query51	4304	4287	4233	4233
query52	107	104	93	93
query53	260	274	210	210
query54	308	274	267	267
query55	91	90	83	83
query56	301	295	314	295
query57	1418	1381	1319	1319
query58	296	276	275	275
query59	1544	1629	1459	1459
query60	319	319	313	313
query61	156	149	154	149
query62	681	633	565	565
query63	249	198	200	198
query64	2428	804	640	640
query65	
query66	1724	478	361	361
query67	30163	30064	29951	29951
query68	
query69	468	335	308	308
query70	946	989	908	908
query71	312	269	269	269
query72	3043	2693	2460	2460
query73	817	762	395	395
query74	5063	4866	4780	4780
query75	2711	2603	2247	2247
query76	2299	1109	742	742
query77	399	410	318	318
query78	12083	12155	11509	11509
query79	1433	1091	721	721
query80	638	540	460	460
query81	451	287	248	248
query82	1352	161	130	130
query83	363	279	246	246
query84	263	140	114	114
query85	890	534	452	452
query86	384	350	333	333
query87	3379	3366	3242	3242
query88	3550	2673	2624	2624
query89	445	385	338	338
query90	1983	181	175	175
query91	180	166	147	147
query92	74	78	76	76
query93	1443	1470	871	871
query94	548	357	300	300
query95	677	407	429	407
query96	1012	799	333	333
query97	2692	2734	2537	2537
query98	238	231	229	229
query99	1106	1127	983	983
Total cold run time: 252779 ms
Total hot run time: 169133 ms

github-actions

I found one remaining correctness issue in the FE access-path collection for VARIANT scalar consumers.

Critical checkpoint conclusions:

Goal/test: the PR aims to support Iceberg/Parquet VARIANT reads and nested pruning; the new tests cover many projection/filter/pruning cases, but not scalar functions that consume a whole VARIANT while returning a non-VARIANT type.
Scope/focus: the changes are broad across FE pruning and BE readers; the issue below is localized to access-path collection.
Concurrency/lifecycle/config/compatibility: no new concurrency, lifecycle, config, or persisted-format concern was identified for this finding.
Parallel paths: output VARIANT expressions, casts, generators, dynamic subscripts, and aliases have special handling, but ordinary non-VARIANT scalar functions over VARIANT inputs do not.
Test coverage: missing regression coverage combining a whole-root scalar consumer such as variant_type(v) with a sibling nested predicate.
Data correctness/performance/observability: the issue is data correctness; it can make scans read only a selected subpath even though the scalar function needs the root value. No extra observability concern beyond fixing/covering this path.

User focus: no additional user-provided review focus was specified.

hello-stephen · 2026-05-17T04:40:22Z

TPC-DS: Total hot run time: 168985 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8587400dc65063b8e8ea52ca286f3e7831788e4a, data reload: false

query5	4322	654	509	509
query6	340	221	204	204
query7	4225	606	315	315
query8	335	245	218	218
query9	8828	3965	3961	3961
query10	446	327	294	294
query11	5775	2410	2227	2227
query12	192	130	125	125
query13	1333	609	401	401
query14	5932	5344	5075	5075
query14_1	4310	4321	4325	4321
query15	209	200	183	183
query16	1000	453	418	418
query17	943	706	592	592
query18	2424	491	344	344
query19	210	193	163	163
query20	133	132	128	128
query21	209	142	126	126
query22	13539	13553	13478	13478
query23	17074	16446	16017	16017
query23_1	16251	16102	16264	16102
query24	7516	1772	1340	1340
query24_1	1301	1299	1303	1299
query25	584	500	447	447
query26	1335	322	181	181
query27	2656	569	335	335
query28	4491	1922	1925	1922
query29	1014	645	517	517
query30	304	243	198	198
query31	1119	1074	944	944
query32	93	87	76	76
query33	554	359	318	318
query34	1160	1153	638	638
query35	770	801	694	694
query36	1336	1352	1206	1206
query37	155	106	95	95
query38	3218	3165	3054	3054
query39	940	912	904	904
query39_1	871	888	881	881
query40	241	152	132	132
query41	74	70	70	70
query42	114	111	112	111
query43	323	332	285	285
query44	
query45	216	208	203	203
query46	1101	1243	752	752
query47	2306	2309	2201	2201
query48	416	419	296	296
query49	657	519	408	408
query50	969	347	275	275
query51	4355	4286	4238	4238
query52	110	108	97	97
query53	254	284	209	209
query54	327	281	278	278
query55	96	94	92	92
query56	312	328	320	320
query57	1410	1416	1319	1319
query58	319	281	281	281
query59	1620	1643	1458	1458
query60	370	319	304	304
query61	160	151	156	151
query62	676	631	537	537
query63	242	198	204	198
query64	2413	793	611	611
query65	
query66	1722	467	372	372
query67	29987	29990	29770	29770
query68	
query69	445	338	313	313
query70	1032	1021	1016	1016
query71	309	268	264	264
query72	3001	2715	2386	2386
query73	869	738	421	421
query74	5043	4855	4726	4726
query75	2653	2600	2280	2280
query76	2304	1118	770	770
query77	393	396	337	337
query78	12047	12025	11526	11526
query79	1456	1061	764	764
query80	640	533	444	444
query81	451	288	243	243
query82	1416	158	121	121
query83	364	279	247	247
query84	307	138	115	115
query85	872	530	443	443
query86	398	311	298	298
query87	3368	3385	3217	3217
query88	3508	2655	2617	2617
query89	425	393	336	336
query90	1963	170	176	170
query91	193	166	138	138
query92	79	80	69	69
query93	1459	1428	818	818
query94	531	359	315	315
query95	666	455	342	342
query96	1079	784	345	345
query97	2713	2686	2541	2541
query98	240	229	226	226
query99	1112	1107	982	982
Total cold run time: 252163 ms
Total hot run time: 168985 ms

eldenmoon · 2026-05-17T04:55:50Z

run buildall

eldenmoon · 2026-05-17T04:55:50Z

/review

hello-stephen · 2026-05-17T04:58:33Z

TPC-H: Total hot run time: 30718 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f7d8cc0c3444ac9b3f094f645ea83ece520f3fb4, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17653	3854	3804	3804
q2	q3	10738	1304	792	792
q4	4680	468	337	337
q5	7607	2212	2059	2059
q6	244	177	140	140
q7	930	768	631	631
q8	9354	1692	1544	1544
q9	6278	4894	4886	4886
q10	6424	2083	1792	1792
q11	441	277	236	236
q12	692	419	296	296
q13	18196	3319	2770	2770
q14	267	248	233	233
q15	q16	828	764	716	716
q17	1067	896	932	896
q18	6864	5701	5454	5454
q19	1246	1299	1018	1018
q20	503	470	319	319
q21	6066	2678	2482	2482
q22	440	372	313	313
Total cold run time: 100518 ms
Total hot run time: 30718 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4584	4525	4698	4525
q2	q3	4816	5260	4630	4630
q4	2108	2175	1366	1366
q5	4786	4642	4651	4642
q6	231	181	126	126
q7	1828	1736	1527	1527
q8	2379	2028	1857	1857
q9	7201	7228	7190	7190
q10	4452	4396	3986	3986
q11	515	373	344	344
q12	703	716	501	501
q13	3063	3402	2830	2830
q14	283	275	250	250
q15	q16	663	694	609	609
q17	1248	1218	1212	1212
q18	7320	6997	6854	6854
q19	1111	1078	1082	1078
q20	2196	2205	1910	1910
q21	5232	4513	4429	4429
q22	526	446	395	395
Total cold run time: 55245 ms
Total hot run time: 50261 ms

github-actions

I completed another full pass over the PR with the Doris code-review checklist and the existing review threads as prior context. I did not add new inline comments because the blocking issues I found are already covered by existing threads, and duplicating them would not add value.

Summary opinion: request changes until the existing correctness threads are resolved. The current change is a large cross-FE/BE feature around Iceberg/Parquet VARIANT reading, nested access-path propagation, and Iceberg write restrictions; the existing review context still identifies multiple data-correctness and runtime-failure risks in those paths.

Critical checkpoint conclusions:

Goal/test proof: the PR aims to support Iceberg VARIANT reads from Parquet with pruning and to block unsupported VARIANT writes. The added tests cover important local/TVF/table-path cases, but existing threads identify several reachable SQL and Parquet-layout cases that are not yet proven.
Scope/focus: the implementation is broad across FE access collection, BE Parquet pruning/reconstruction, Iceberg type mapping, and merge sinks; the size is understandable for the feature but increases integration risk.
Concurrency/lifecycle: no distinct new concurrency, lock-order, or static-lifecycle issue found in this pass.
Configuration/compatibility: no new config item issue found. The feature touches FE-BE thrift/schema contracts and Iceberg/Parquet format behavior; existing threads cover compatibility/contract gaps that should be resolved.
Parallel code paths: existing threads already flag several parallel-path gaps across local/Hive/Iceberg Parquet, OLAP vs external scans, generator/project/alias propagation, and Iceberg table vs local TVF coverage.
Special conditions: existing threads cover non-obvious conditions around VARIANT root reads, dynamic subscripts, NULL/OFFSET pseudo paths, residual value fallback, typed-only wrappers, and field-id conversion.
Test coverage: coverage improved, but existing threads identify missing legal SQL-level, Iceberg table-path, residual/typed mixed, MAP/ARRAY offset/null, and generator/scalar-function cases.
Results correctness: no new result-file issue found beyond existing test-state/profile and coverage concerns.
Observability: profile assertions were added and are useful; no additional observability issue found in this pass.
Transaction/persistence/data writes: no Doris transaction persistence issue found. Existing review context covers Iceberg delete-only MERGE/VARIANT write semantics.
FE-BE variable passing: existing threads already cover access-path serialization/field-id and merge-sink contract risks; no distinct new instance found.
Performance: existing OFFSET/null pruning issues cover the main performance/correctness risks I saw; no additional separate performance issue found.

User focus: no additional user-provided review focus was supplied.

eldenmoon · 2026-05-17T05:05:06Z

/review

eldenmoon · 2026-05-17T05:05:07Z

run buildall

hello-stephen · 2026-05-17T05:09:21Z

TPC-DS: Total hot run time: 168514 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f7d8cc0c3444ac9b3f094f645ea83ece520f3fb4, data reload: false

query5	4314	639	493	493
query6	331	217	201	201
query7	4244	521	317	317
query8	318	229	224	224
query9	8825	3953	3931	3931
query10	450	337	286	286
query11	5781	2352	2236	2236
query12	180	126	122	122
query13	1298	599	452	452
query14	5937	5321	4995	4995
query14_1	4295	4270	4272	4270
query15	210	200	181	181
query16	1008	434	430	430
query17	1080	717	578	578
query18	2458	479	353	353
query19	225	213	168	168
query20	135	135	124	124
query21	220	135	119	119
query22	13623	13404	13313	13313
query23	17051	16388	16018	16018
query23_1	16127	16248	16122	16122
query24	7395	1740	1291	1291
query24_1	1288	1289	1300	1289
query25	580	503	436	436
query26	1299	311	172	172
query27	2725	556	350	350
query28	4461	1946	1960	1946
query29	1007	635	521	521
query30	305	244	202	202
query31	1117	1071	941	941
query32	91	79	74	74
query33	569	379	287	287
query34	1183	1127	635	635
query35	766	772	676	676
query36	1328	1320	1221	1221
query37	147	98	85	85
query38	3184	3116	3052	3052
query39	924	913	906	906
query39_1	860	884	863	863
query40	237	140	125	125
query41	65	65	60	60
query42	114	108	105	105
query43	313	315	281	281
query44	
query45	205	193	188	188
query46	1044	1177	737	737
query47	2319	2397	2241	2241
query48	394	403	311	311
query49	641	490	386	386
query50	1003	355	252	252
query51	4320	4289	4233	4233
query52	104	103	93	93
query53	252	293	206	206
query54	304	268	243	243
query55	90	88	85	85
query56	298	304	297	297
query57	1450	1397	1297	1297
query58	303	265	270	265
query59	1531	1682	1475	1475
query60	326	331	312	312
query61	155	154	181	154
query62	682	633	562	562
query63	262	204	203	203
query64	2437	819	659	659
query65	
query66	1731	491	349	349
query67	29910	29810	29661	29661
query68	
query69	470	335	290	290
query70	998	1035	956	956
query71	305	272	266	266
query72	2939	2732	2394	2394
query73	841	760	413	413
query74	5070	4861	4716	4716
query75	2669	2573	2207	2207
query76	2316	1143	766	766
query77	399	408	320	320
query78	12163	11940	11632	11632
query79	1375	1034	687	687
query80	653	541	441	441
query81	441	275	246	246
query82	1302	156	123	123
query83	348	271	252	252
query84	295	145	110	110
query85	886	528	458	458
query86	394	339	327	327
query87	3354	3335	3254	3254
query88	3531	2687	2649	2649
query89	431	389	334	334
query90	1913	176	183	176
query91	186	169	147	147
query92	80	75	76	75
query93	1509	1432	868	868
query94	519	325	308	308
query95	685	461	335	335
query96	1039	844	348	348
query97	2703	2683	2559	2559
query98	232	235	233	233
query99	1106	1108	1003	1003
Total cold run time: 252020 ms
Total hot run time: 168514 ms

eldenmoon · 2026-05-17T05:11:29Z

run buildall

eldenmoon · 2026-05-17T05:11:31Z

/review

github-actions

I completed another pass over the full PR at head 87a7378, including the existing review threads and the requested focus file. I did not add duplicate inline comments because the distinct blocking issues I could validate are already captured in the existing inline threads.

Summary opinion: request changes. The PR goal is clear: support reading Iceberg VARIANT data from Parquet and extend nested-column/access-path pruning across FE and BE paths. The implementation is broad and has useful tests, but the existing threads identify multiple remaining correctness blockers in access-path propagation and Parquet VARIANT reconstruction/pruning. In particular, several scenarios can prune away data that is still required for a projected VARIANT value, dynamic access, generator output, scalar whole-VARIANT consumers, MAP/ARRAY level-only reads, or residual shredded VARIANT values. These are data-correctness issues rather than minor style problems.

Critical checkpoint conclusions:

Goal and tests: the feature goal is partially implemented, but current coverage is not sufficient for the failing end-to-end paths already called out, especially Iceberg table-path behavior and mixed projected/nested VARIANT access.
Scope and clarity: the change is large and crosses FE access collection, BE Parquet reader/pruning, Iceberg/Hive table paths, and regression data. Some fixes remain too generic and miss expression-specific path semantics.
Concurrency and lifecycle: I did not identify a new concurrency or lifecycle-specific issue in this pass.
Configuration and compatibility: no new config was added. Iceberg/Parquet format compatibility remains the relevant concern; existing comments cover cases where accepted layouts are misread or fail at runtime.
Parallel paths: several parallel paths still need consistent handling, including local/Hive/Iceberg Parquet, OLAP vs file scans, map keys vs values, and generator/scalar expression access propagation.
Conditional checks and error handling: existing comments include cases where invalid physical column indexes or unsupported typed values can lead to crashes/failures instead of safe behavior.
Test coverage: more coverage is required for the already-commented edge cases. The focus file had no additional user-provided focus points, so there were no extra focus-specific findings.
Observability/performance: the intended pruning optimization is useful, but existing comments show some OFFSET/NULL-only paths still read unnecessary payloads or can crash. I did not find a separate observability issue.

Please address the active inline threads before this can be considered safe to merge.

### What problem does this PR solve? Issue Number: N/A Related PR: apache#63192 Problem Summary: Doris could not read Iceberg v3 VARIANT columns from Parquet files. This change maps Iceberg VARIANT to Doris VARIANT, validates the Parquet VARIANT wrapper shape from the VariantShredding spec, decodes unshredded metadata/value encoding, reads shredded typed_value columns, and prunes shredded Parquet leaf columns for accessed variant paths with profile observability. Typed-only shredded projections stay on native Parquet typed columns when residual value columns are not selected, while selected residual or complex layouts fall back to row-wise reconstruction. This also keeps VARIANT pruning independent from unsupported Doris nested VARIANT types, preserves ordinary non-VARIANT constructed struct pruning, treats missing shredded array elements as corruption per VariantShredding, keeps explicit Variant null array elements readable through a present wrapper, makes whole-VARIANT scalar consumers such as variant_type(v) force root access when sibling predicates only read subpaths, and rejects non-Parquet Iceberg VARIANT reads during scan planning. ### Release note Support reading Iceberg v3 VARIANT Parquet columns, including shredded typed_value column pruning and binary/UUID/primitive residual VARIANT values. Writing Iceberg VARIANT columns is rejected with an explicit unsupported error. ### Check List (For Author) - Test: Regression test / Unit Test / Manual test - Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.DirectTypedOnlyReaderCountersUseNativePath:ParquetVariantReaderTest.VariantReaderCountersUseRowWiseWhenResidualValueSelected:ParquetVariantReaderTest.RowWisePreservesExplicitVariantNullShreddedArrayElement:ParquetVariantReaderTest.RowWiseRejectsMissingShreddedArrayElement' (4 tests passed) - Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.*' (85 tests passed on rerun; the first attempt failed before tests in OpenBLAS CMake getarch bootstrap) - Unit Test: ./run-be-ut.sh --run --filter='ParquetVariantReaderTest.*:NestedColumnAccessHelperTest.*' (127 tests passed) - Unit Test: ./run-be-ut.sh --run --filter='IcebergReaderCreateColumnIdsTest.*' (9 tests passed) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest (66 tests passed; Maven reactor succeeded before the final variant_type review fix) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest#testVariantTypeWholeExpressionWithPredicateAccessPath (1 test passed; Maven reactor succeeded. A prior full-class rerun after the final fix hit a local NoClassDefFoundError for TimeUtils before test assertions, then the targeted rerun succeeded after classes were regenerated.) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.source.IcebergScanNodeTest (5 tests passed; Maven reactor succeeded) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.VariantPruningLogicTest (11 tests passed; Maven reactor succeeded) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.datasource.iceberg.IcebergUtilsTest (passed) - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.SlotTypeReplacerTest (5 tests passed) - Regression test: performance regression coverage is included in regression-test/suites/external_table_p0/tvf/test_local_tvf_iceberg_variant.groovy, including profile assertions that typed-only projections increment VariantDirectTypedValueReadRows and keep VariantRowWiseReadRows at 0. Not run locally in this worktree because no local Doris cluster/output BE+FE runtime is available. - Regression test: Added regression-test/suites/external_table_p0/iceberg/test_iceberg_variant_table_path.groovy to exercise the Iceberg REST catalog table path with nested VARIANT access and profile read-column assertions. Not run locally because Docker access to spark-iceberg is unavailable in this worktree. - Manual test: PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:$PATH build-support/clang-format.sh - Manual test: PATH=/mnt/disk6/common/ldb_toolchain_toucan/bin:$PATH build-support/check-format.sh - Manual test: git diff --check - Manual test: cd fe && mvn -pl fe-core checkstyle:check -DskipTests - Static analysis: build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN was attempted on the current C++ diff. The default PATH had no clang-tidy; clang-tidy-16 could not analyze the changed BE files because included be/src/util/jni-util.h triggers pre-existing static_assert(false) analyzer errors; clang-tidy-20 with an explicit resource-dir was blocked by a pre-existing clang-tidy-nolint error in be/src/core/types.h. - Behavior changed: Yes. Doris can read Iceberg v3 VARIANT Parquet columns, supports typed-only shredded projection pruning on native typed columns, reconstructs selected residual or complex layouts row-wise, rejects malformed shredded array elements with missing payload wrappers, preserves explicit Variant null array elements, forces root access for whole-VARIANT scalar consumers such as variant_type(v), rejects Iceberg VARIANT reads from non-Parquet file formats during scan planning, and rejects Iceberg VARIANT data-file writes explicitly. - Does this need documentation: No

eldenmoon · 2026-05-17T05:19:28Z

run buildall

eldenmoon · 2026-05-17T05:19:29Z

/review

github-actions

I found one additional blocking issue. The PR's main goal is Iceberg/Parquet VARIANT read support with nested access-path pruning while rejecting unsupported paths; the current code mostly targets that goal, but the ORC rejection is incomplete for nested VARIANT types.

Critical checkpoint conclusions:

Goal/test coverage: top-level Iceberg ORC VARIANT rejection is tested, but nested VARIANT-in-complex ORC coverage is missing and the implementation does not reject it.
Scope/focus: the change is broad but generally focused on VARIANT Parquet/Iceberg read/pruning and Iceberg write restrictions.
Concurrency/lifecycle/config: I did not find new concurrency, lifecycle, or configuration risks in the reviewed paths.
Compatibility: Iceberg VARIANT is newly exposed recursively; unsupported ORC read compatibility must fail early for all exposed shapes.
Parallel paths: write validation recursively checks Iceberg VARIANT, but read validation only checks top-level Doris VARIANT slots.
Data correctness/error handling: unsupported nested VARIANT ORC scans can reach BE instead of producing the intended planning error.
Tests: add a nested complex-type ORC rejection test, for example STRUCT<VARIANT> / ARRAY<VARIANT> / map value VARIANT.
Observability/performance/transactions: no additional issue found beyond the unsupported-path validation gap.

User focus: no additional user-provided review focus was supplied.

github-actions · 2026-05-17T05:21:23Z


+    private void validateVariantReadSupported(String icebergFormat) throws DdlException {
+        for (SlotDescriptor slot : desc.getSlots()) {
+            if (slot.getColumn().getType().isVariantType()) {


This check only rejects slots whose top-level Doris type is VARIANT, but this PR maps Iceberg VARIANT recursively inside complex types. For an ORC Iceberg table with a column like STRUCT<v: VARIANT> (or ARRAY<VARIANT> / map value VARIANT), the slot type is struct/array/map, so isVariantType() is false and the scan reaches BE instead of failing with the Parquet-only error. Please make the read-side validation recursive over complex Doris types (or reuse the Iceberg schema traversal used for write validation) and add ORC coverage for nested VARIANT inside a complex column.

hello-stephen · 2026-05-17T05:26:32Z

FE UT Coverage Report

Increment line coverage 76.38% (207/271) 🎉
Increment coverage report
Complete coverage report

github-actions

Automated review result: request changes.

Critical checkpoint conclusions:

Goal/test coverage: The PR adds Iceberg/Parquet VARIANT read support, nested pruning, and write rejection. Several paths are covered, but two supported-looking cases remain unguarded or incorrect: ORC nested VARIANT validation and typed VARIANT arrays with null complex elements.
Scope/focus: The implementation is generally focused on VARIANT Parquet/Iceberg support, but the read validation and reconstruction semantics need tightening before merge.
Concurrency/lifecycle/config: I did not find new concurrency, lifecycle, or configuration hazards in the reviewed diff.
Compatibility: Iceberg ORC must fail before reaching unsupported nested VARIANT reads; currently only top-level VARIANT is rejected.
Parallel code paths: Parquet row-wise typed map reconstruction preserves null values, but typed array reconstruction treats missing/null complex elements as corruption.
Tests: Additional coverage is needed for ORC Iceberg complex columns containing VARIANT and for typed VARIANT arrays containing null complex elements.
Observability/transactions/persistence: Not applicable to the identified issues.
Performance: No separate blocking performance issue found beyond existing review threads.

User focus points: No additional user-provided review focus was specified.

github-actions · 2026-05-17T05:27:07Z


+    private void validateVariantReadSupported(String icebergFormat) throws DdlException {
+        for (SlotDescriptor slot : desc.getSlots()) {
+            if (slot.getColumn().getType().isVariantType()) {


This validation only checks whether the top-level selected column is VARIANT, but this PR maps Iceberg VARIANT recursively inside complex types. An ORC Iceberg schema such as struct<s: variant> or array<variant> will pass here because the slot column type is STRUCT/ARRAY, then proceed into an unsupported non-Parquet VARIANT read path. Please recursively inspect the selected column type for nested VARIANTs and add ORC Iceberg coverage for complex columns containing VARIANT.

github-actions · 2026-05-17T05:27:07Z

+        PathInDataBuilder element_path;
+        RETURN_IF_ERROR(shredded_field_to_variant_map(element_schema, element, metadata,
+                                                      &element_path, &element_values,
+                                                      &element_present, string_values));


This treats a null complex element in a typed VARIANT array as file corruption. For a valid value like [null, {"a":1}] shredded as typed_value: list<optional struct<a:int>>, the null element reaches shredded_field_to_variant_map() with present=false and fails here instead of preserving a null array element. The map path below already materializes absent values as Field(), so arrays need equivalent null-element handling rather than rejecting the row. Please add coverage for typed VARIANT arrays containing null complex elements.

hello-stephen · 2026-05-17T05:28:10Z

TPC-H: Total hot run time: 31220 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 87a7378a43c24b89233a6fb138ed1b79d6776812, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17699	4071	4039	4039
q2	q3	10771	1380	796	796
q4	4704	468	352	352
q5	7565	2214	2139	2139
q6	264	173	138	138
q7	1011	760	636	636
q8	9375	1721	1680	1680
q9	6651	4989	4879	4879
q10	6487	2129	1819	1819
q11	449	277	244	244
q12	693	426	288	288
q13	18252	3318	2787	2787
q14	261	254	233	233
q15	q16	814	770	704	704
q17	938	923	898	898
q18	7106	5730	5535	5535
q19	1246	1274	1122	1122
q20	509	389	262	262
q21	5680	2569	2363	2363
q22	442	354	306	306
Total cold run time: 100917 ms
Total hot run time: 31220 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4359	4173	4177	4173
q2	q3	4462	4905	4349	4349
q4	2289	2191	1395	1395
q5	4409	4268	4325	4268
q6	388	217	146	146
q7	2083	1845	1626	1626
q8	2522	2124	2043	2043
q9	7854	7823	7719	7719
q10	4500	4456	4069	4069
q11	760	427	393	393
q12	716	735	517	517
q13	3458	3576	3044	3044
q14	323	311	307	307
q15	q16	732	748	653	653
q17	1358	1335	1335	1335
q18	7907	7423	7054	7054
q19	1174	1093	1112	1093
q20	2218	2205	1928	1928
q21	5265	4719	4571	4571
q22	509	466	412	412
Total cold run time: 57286 ms
Total hot run time: 51095 ms

hello-stephen · 2026-05-17T05:39:03Z

TPC-DS: Total hot run time: 169362 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 87a7378a43c24b89233a6fb138ed1b79d6776812, data reload: false

query5	4316	658	535	535
query6	368	224	214	214
query7	4382	569	302	302
query8	324	227	209	209
query9	8819	3986	3997	3986
query10	458	345	296	296
query11	5800	2325	2204	2204
query12	192	126	125	125
query13	1330	628	456	456
query14	6001	5348	5154	5154
query14_1	4490	4460	4489	4460
query15	213	209	187	187
query16	993	477	460	460
query17	1154	753	607	607
query18	2505	507	379	379
query19	227	212	170	170
query20	140	132	136	132
query21	215	140	120	120
query22	13612	13541	13340	13340
query23	17287	16550	16039	16039
query23_1	16202	16196	16206	16196
query24	7498	1771	1287	1287
query24_1	1320	1293	1329	1293
query25	609	505	457	457
query26	1323	320	176	176
query27	2676	584	341	341
query28	4469	1958	1932	1932
query29	1013	659	509	509
query30	303	241	200	200
query31	1141	1084	954	954
query32	86	81	76	76
query33	544	366	317	317
query34	1159	1115	633	633
query35	763	775	684	684
query36	1326	1378	1184	1184
query37	149	102	99	99
query38	3199	3154	3043	3043
query39	923	943	890	890
query39_1	865	888	868	868
query40	231	146	122	122
query41	65	68	64	64
query42	111	105	107	105
query43	324	319	283	283
query44	
query45	203	202	192	192
query46	1061	1227	703	703
query47	2317	2344	2212	2212
query48	381	409	284	284
query49	627	489	385	385
query50	971	353	253	253
query51	4268	4337	4208	4208
query52	105	105	98	98
query53	255	285	198	198
query54	308	259	251	251
query55	93	90	84	84
query56	296	299	303	299
query57	1418	1387	1312	1312
query58	296	270	266	266
query59	1531	1647	1396	1396
query60	319	319	313	313
query61	165	163	160	160
query62	663	618	548	548
query63	239	192	201	192
query64	2420	823	662	662
query65	
query66	1750	476	358	358
query67	30034	29946	29179	29179
query68	
query69	476	341	348	341
query70	1045	1003	958	958
query71	305	273	274	273
query72	2958	2689	2444	2444
query73	870	787	418	418
query74	5051	4902	4747	4747
query75	2661	2619	2249	2249
query76	2301	1157	771	771
query77	390	398	333	333
query78	12184	12201	11637	11637
query79	1467	1037	741	741
query80	658	555	456	456
query81	454	275	247	247
query82	1351	157	125	125
query83	352	278	254	254
query84	259	140	117	117
query85	881	549	460	460
query86	381	345	310	310
query87	3422	3385	3218	3218
query88	3495	2640	2625	2625
query89	441	393	338	338
query90	2040	179	181	179
query91	179	168	140	140
query92	78	79	72	72
query93	1470	1427	844	844
query94	541	361	321	321
query95	678	375	425	375
query96	1017	738	353	353
query97	2690	2691	2577	2577
query98	243	230	223	223
query99	1107	1118	973	973
Total cold run time: 253323 ms
Total hot run time: 169362 ms

eldenmoon force-pushed the codex/iceberg-v3-variant branch from cab85b8 to e9e3bfd Compare May 12, 2026 19:31

github-actions Bot previously requested changes May 12, 2026

View reviewed changes

Comment thread regression-test/suites/external_table_p0/tvf/test_local_tvf_iceberg_variant.groovy

github-actions Bot previously requested changes May 12, 2026

View reviewed changes

Comment thread be/src/format/table/hive/hive_parquet_nested_column_utils.cpp Outdated

Comment thread be/src/format/table/iceberg/iceberg_parquet_nested_column_utils.cpp Outdated

eldenmoon force-pushed the codex/iceberg-v3-variant branch from e9e3bfd to 5fe9ca5 Compare May 12, 2026 20:03

github-actions Bot previously requested changes May 12, 2026

View reviewed changes

Comment thread be/src/format/table/hive/hive_parquet_nested_column_utils.cpp Outdated

Comment thread be/src/format/table/iceberg/iceberg_parquet_nested_column_utils.cpp Outdated

github-actions Bot previously requested changes May 12, 2026

View reviewed changes

Comment thread be/src/format/table/hive/hive_parquet_nested_column_utils.cpp Outdated

Comment thread be/src/format/table/iceberg/iceberg_parquet_nested_column_utils.cpp Outdated

eldenmoon force-pushed the codex/iceberg-v3-variant branch from 5fe9ca5 to fa098c0 Compare May 12, 2026 20:43

github-actions Bot previously requested changes May 12, 2026

View reviewed changes

Comment thread be/src/format/parquet/vparquet_column_reader.cpp Outdated

Comment thread be/src/format/parquet/vparquet_column_reader.cpp Outdated

eldenmoon force-pushed the codex/iceberg-v3-variant branch from fa098c0 to bf0c548 Compare May 12, 2026 21:06

github-actions Bot previously requested changes May 12, 2026

View reviewed changes

Comment thread be/src/format/parquet/schema_desc.cpp Outdated

Comment thread be/src/format/table/hive/hive_parquet_nested_column_utils.cpp Outdated

Comment thread be/src/format/table/iceberg/iceberg_parquet_nested_column_utils.cpp Outdated

github-actions Bot requested changes May 17, 2026

View reviewed changes

Comment thread ...core/src/main/java/org/apache/doris/nereids/rules/rewrite/AccessPathExpressionCollector.java

github-actions Bot requested changes May 17, 2026

View reviewed changes

Conversation

eldenmoon commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Uh oh!

hello-stephen commented May 12, 2026

Uh oh!

eldenmoon commented May 12, 2026

Uh oh!

eldenmoon commented May 12, 2026

Uh oh!

eldenmoon commented May 12, 2026

Uh oh!

eldenmoon commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

eldenmoon commented May 12, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eldenmoon commented May 12, 2026

Uh oh!

eldenmoon commented May 12, 2026

Uh oh!

eldenmoon commented May 12, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eldenmoon commented May 12, 2026

Uh oh!

eldenmoon commented May 12, 2026

Uh oh!

hello-stephen commented May 12, 2026

Uh oh!

hello-stephen commented May 12, 2026

Uh oh!

hello-stephen commented May 12, 2026

FE UT Coverage Report

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hello-stephen commented May 12, 2026

Uh oh!

hello-stephen commented May 12, 2026

Uh oh!

eldenmoon commented May 12, 2026

Uh oh!

eldenmoon commented May 12, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hello-stephen commented May 12, 2026

Uh oh!

eldenmoon commented May 17, 2026

Uh oh!

eldenmoon commented May 12, 2026 •

edited

Loading