Skip to content

[fix](be) Preserve Variant decimal precision#63322

Open
eldenmoon wants to merge 1 commit into
apache:masterfrom
eldenmoon:codex/cir-20240-variant-precision-pr
Open

[fix](be) Preserve Variant decimal precision#63322
eldenmoon wants to merge 1 commit into
apache:masterfrom
eldenmoon:codex/cir-20240-variant-precision-pr

Conversation

@eldenmoon
Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: CIR-20240

Related PR: None

Problem Summary: Variant JSON parsing materialized high precision decimal-like numbers as double before typed decimal paths were cast, which lost precision. Preserve raw numeric tokens for decimal-sensitive Variant path extraction and cast typed decimal paths from exact text while keeping Variant nullable array structure.

Release note

Fix Variant typed decimal precision loss when parsing high precision JSON numbers.

Check List (For Author)

  • Test:
    • Build: BUILD_TYPE=ASAN USE_MEM_TRACKER=ON ./build.sh --be --fe
    • Unit Test: ./run-be-ut.sh --run --filter='JsonParserTest.*:ColumnVariantTest.parse_high_precision_decimal_for_typed_path'
    • Regression test: env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY -u all_proxy -u ALL_PROXY NO_PROXY=127.0.0.1,localhost,10.26.20.3 no_proxy=127.0.0.1,localhost,10.26.20.3 ./run-regression-test.sh --conf tmp/regression-conf.auto.groovy --run -d variant_p0/predefine -s test_variant_high_precision_decimal
    • Format Check: build-support/clang-format.sh; build-support/check-format.sh
    • Static Check: build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN (attempted; failed due existing/environment diagnostics including missing stddef.h and unrelated existing readability-function-size warnings)
  • Behavior changed: Yes (Variant typed decimal paths cast preserved raw JSON number text instead of rounded double for precision-sensitive numeric tokens)
  • Does this need documentation: No

Copilot AI review requested due to automatic review settings May 17, 2026 09:38
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes precision loss for Variant typed decimal extraction by preserving high-precision JSON numeric tokens as exact text during parsing, and ensuring typed-path casts keep Variant’s nullable/array nullability semantics. It also adds targeted unit/regression coverage for high-precision decimals in both scalar and array paths.

Changes:

  • Preserve raw JSON numeric tokens (via simdjson on-demand parsing) and materialize precision-sensitive numbers as strings for later exact decimal casts.
  • Preserve Variant nullability (including nested array nullability) when converting typed paths to storage types.
  • Add unit + regression tests covering high-precision decimals in Variant scalar and array typed paths.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
regression-test/suites/variant_p0/predefine/test_variant_high_precision_decimal.groovy New regression test validating high-precision decimal preservation for Variant typed paths (scalar + array).
be/test/core/jsonb/json_parser_test.cpp Unit tests asserting high-precision numeric tokens are preserved as string fields (including inside arrays).
be/test/core/column/column_variant_test.cpp Unit test validating typed-path conversion + serialization preserves decimal text/precision for scalars and arrays.
be/src/util/json/simd_json_parser.h Refactors SimdJSONParser to build an internal node tree from simdjson on-demand and exposes raw number tokens.
be/src/util/json/json_parser.h Adds heuristic to preserve precision-sensitive doubles as strings based on raw token digits.
be/src/exprs/function/cast/cast_to_decimal.h Enhances string→decimal cast to accept nullable string columns and propagate nulls.
be/src/core/column/column_variant.cpp Preserves Variant nullability when casting typed paths to declared storage types (incl. arrays).
Comments suppressed due to low confidence (1)

be/src/util/json/json_parser.h:30

  • json_parser.h introduces should_preserve_json_number_as_string(std::string_view ...) but the header does not include <string_view>. Relying on transitive includes is brittle; add #include <string_view> to ensure this header compiles standalone.
#include <parallel_hashmap/phmap.h>

#include <cstddef>
#include <optional>
#include <string>
#include <utility>
#include <vector>


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread be/src/util/json/simd_json_parser.h
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking correctness regression. The PR addresses typed Variant decimal precision for the tested path, but the parser change currently applies to all Variant subcolumn extraction and changes some JSON numbers into strings when the path is not declared as a typed decimal path.

Critical checkpoints:

  • Goal/test: The PR fixes the covered typed decimal path case and adds BE/regression coverage for that case, but misses untyped Variant numeric semantics.
  • Scope/focus: The implementation is broader than the stated goal because generic JSON parsing now changes inferred Field type based on token precision.
  • Concurrency/lifecycle: No new shared concurrency or non-obvious lifecycle issue identified in the reviewed paths.
  • Compatibility/behavior: Behavior changes for existing Variant data: high-precision JSON numeric tokens on untyped paths can become strings. This is user-visible and not covered by compatibility/tests.
  • Parallel paths: Doc-value/jsonb writing still treats doubles as doubles; the subcolumn path is the one affected by the string conversion.
  • Tests: Added tests cover typed decimal conversion, but no negative/regression coverage proves ordinary untyped numbers remain numeric.
  • Performance/memory/observability: No additional blocking issue found beyond the correctness problem above.

User focus: No additional user-provided review focus was specified.

Comment thread be/src/util/json/json_parser.h
@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from ecba866 to 9ccf680 Compare May 17, 2026 10:19
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional correctness issue beyond the existing review threads.

Critical checkpoint conclusions:

  • Goal/test coverage: The PR aims to preserve high-precision JSON numbers for Variant decimal typed paths and adds BE/regression coverage, but scientific-notation decimal tokens remain uncovered and can still lose precision.
  • Scope/focus: The change is mostly focused on Variant JSON parsing and typed-path decimal conversion, though the simdjson wrapper rewrite is broader than the local fix.
  • Concurrency/lifecycle: No new shared mutable runtime state or lock-order concern found in the changed path. Parser-owned node lifetimes are local to each parse call.
  • Configuration/compatibility: No new config or storage-format change found. Existing typed path schema metadata is used to drive parsing.
  • Parallel paths: Typed decimal preservation is wired into normal Variant ingest, but the heuristic misses exponent notation on the same path.
  • Error handling/memory safety: No ignored Status issue found in the reviewed changes. The parser rewrite adds extra allocations, but I did not find a separate blocking memory-safety issue.
  • Data correctness: Blocking issue below: some valid decimal typed-path inputs still round through double before decimal conversion.
  • Observability: No additional observability requirement identified for this local parsing/casting fix.

User focus: No additional user-provided review focus was specified.

Comment thread be/src/util/json/json_parser.h Outdated
@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 9ccf680 to 832ca14 Compare May 17, 2026 10:37
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional correctness issue. The existing review threads already cover the raw-number heuristic/scientific-notation precision cases and the missing self-contained <cassert> include, so I did not duplicate those.

Critical checkpoint conclusions:

  • Goal/test: the PR aims to preserve high-precision decimal JSON tokens for Variant predefined decimal paths; it adds non-doc-mode unit/regression coverage, but doc-mode predefined typed paths are not covered and still lose precision.
  • Scope: the change is focused, but the fix is incomplete across the parallel doc-mode write path.
  • Concurrency/lifecycle: no new shared mutable concurrency or special lifecycle issue found in the reviewed change.
  • Config/compatibility: no new external config or storage-format compatibility change found.
  • Parallel paths: doc-mode Variant materialization is a parallel path and needs the same decimal-preservation setup.
  • Tests: add a doc-mode predefined decimal regression case, preferably under variant_p0/doc_mode/predefine, to prove the fix.
  • Observability/transactions/data writes: no additional logging or transaction/persistence concern found beyond the data-correctness issue below.
  • Performance: the parser now materializes/copies the JSON tree to retain raw tokens; this is worth watching on insert hot paths, but I did not mark it blocking without benchmark evidence.

User focus: no additional user-provided review focus was supplied.

Comment thread be/src/exec/common/variant_util.cpp
@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 832ca14 to 43b1a14 Compare May 17, 2026 10:47
@eldenmoon
Copy link
Copy Markdown
Member Author

/review

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completed a full pass over the PR diff and the existing inline review context. I am requesting changes based on the already-open inline threads rather than adding duplicate comments.

Summary opinion: the PR is targeted at preserving high-precision JSON decimal tokens for predefined Variant decimal paths, and the current non-doc-mode path is closer to the goal. However, the implementation is still incomplete for doc-mode predefined Variant columns, where the parser is configured as OnlyDocValueColumn and the decimal raw-number preserve paths are still not installed. be/src/util/json/simd_json_parser.h also still uses assert() without including <cassert>, so the header remains non-self-contained. I did not find additional distinct issues beyond the existing threads.

Critical checkpoint conclusions:

  • Goal and tests: the non-doc-mode typed decimal case has regression/unit coverage, but doc-mode typed decimal coverage is still missing and the doc-mode path does not accomplish the precision-preservation goal.
  • Scope/focus: the functional change is mostly focused, though the simdjson wrapper refactor is larger than the minimal parser hook; no separate blocking issue found beyond the existing header/self-containment concern.
  • Concurrency: no new shared mutable cross-thread state or lock interaction was introduced in the reviewed paths.
  • Lifecycle/static initialization: no new static initialization dependency was found. Parser element lifetimes are local to parsing/traversal and copied into the new tree before the padded input is destroyed.
  • Configuration: no new configuration items were added.
  • Compatibility: no storage-format or FE-BE protocol change was found. The remaining compatibility risk is behavioral inconsistency for doc-mode predefined decimal Variant paths.
  • Parallel paths: doc-mode is the parallel insert/materialization path that still needs the same decimal preserve-path setup.
  • Conditional checks: the typed-path preservation condition is now path-gated; no additional distinct condition issue found.
  • Test coverage/results: tests cover non-doc mode, arrays, glob matching, and scientific notation, but not doc-mode predefined decimals.
  • Observability: no new observability requirement identified for this parser/cast fix.
  • Transactions/persistence/data writes: no new transaction or persistence metadata path was introduced.
  • Performance: the JSON parser wrapper now materializes a copied node tree, but I did not identify a concrete correctness blocker from this beyond the existing issues.
  • User focus: no additional user-provided review focus was specified.

@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 43b1a14 to 7885a6c Compare May 17, 2026 11:08
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional correctness issue in the nullable string-to-decimal cast path. Existing inline threads already cover the remaining blockers around doc-mode typed decimal preservation and parser/header concerns, so I did not duplicate those comments.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to preserve exact JSON numeric text for predefined decimal Variant typed paths and adds unit/regression coverage, but doc-mode typed paths remain covered by an existing unresolved thread and nullable decimal conversion needs more coverage.
  • Scope: The change is mostly focused, but the parser replacement adds broad allocation/copying behavior; no separate blocker found beyond the submitted issue and existing comments.
  • Concurrency/lifecycle: No new shared concurrent state or non-obvious lifecycle risk identified; parser result references are copied into the owned tree before traversal.
  • Config/compatibility: No new config or storage format incompatibility identified.
  • Parallel paths: Non-doc-mode path is configured; doc-mode path remains an existing unresolved review concern.
  • Error handling/data correctness: The new strict nullable cast path can misparse rows because skipped null rows do not advance string offsets in the decimal serde.
  • Tests: Added tests cover high-precision predefined decimals in non-doc mode, arrays, globs, and exponent cases; missing coverage remains for doc mode (already raised) and nullable/missing typed decimal rows.
  • Observability/performance: No new observability requirement identified; parser memory overhead is higher but not raised as a blocking issue.

User focus: No additional user-provided review focus was specified.

Comment thread be/src/exprs/function/cast/cast_to_decimal.h
@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31288 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9ccf680bcc7622c4472bb012526de8e07acd5055, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17710	3972	3884	3884
q2	q3	10805	1389	829	829
q4	4686	479	344	344
q5	7561	2298	2145	2145
q6	362	178	139	139
q7	971	774	652	652
q8	9641	1721	1729	1721
q9	6702	4974	4944	4944
q10	6470	2150	1828	1828
q11	454	273	247	247
q12	700	420	295	295
q13	18248	3545	2804	2804
q14	264	255	237	237
q15	q16	823	776	705	705
q17	934	877	947	877
q18	6949	5687	5721	5687
q19	1250	1270	1064	1064
q20	496	418	255	255
q21	5749	2580	2326	2326
q22	444	377	305	305
Total cold run time: 101219 ms
Total hot run time: 31288 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4201	4114	4127	4114
q2	q3	4514	4886	4366	4366
q4	2138	2196	1421	1421
q5	4426	4305	4617	4305
q6	249	193	146	146
q7	2061	1850	1599	1599
q8	2493	2099	2144	2099
q9	7794	7881	7766	7766
q10	4558	4497	4326	4326
q11	617	456	385	385
q12	758	732	535	535
q13	3258	3683	3044	3044
q14	292	318	294	294
q15	q16	720	768	658	658
q17	1308	1322	1317	1317
q18	7812	7413	6904	6904
q19	1127	1086	1102	1086
q20	2233	2233	1938	1938
q21	5373	4660	4564	4564
q22	529	474	406	406
Total cold run time: 56461 ms
Total hot run time: 51273 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31294 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 43b1a1432eb96745bdb80b377765e638551b4e56, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17684	3930	3838	3838
q2	q3	10778	1358	779	779
q4	4686	474	339	339
q5	7570	2248	2114	2114
q6	389	176	138	138
q7	938	783	619	619
q8	9482	1809	1632	1632
q9	7044	4914	4867	4867
q10	6442	2094	1812	1812
q11	441	276	250	250
q12	693	422	296	296
q13	18244	3462	2833	2833
q14	271	255	235	235
q15	q16	823	776	706	706
q17	977	938	927	927
q18	6849	5646	5541	5541
q19	1187	1270	1238	1238
q20	533	418	278	278
q21	6104	2833	2542	2542
q22	456	370	310	310
Total cold run time: 101591 ms
Total hot run time: 31294 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4695	4548	4689	4548
q2	q3	4887	5278	4672	4672
q4	2120	2217	1431	1431
q5	4873	4738	4655	4655
q6	235	183	129	129
q7	1888	1691	1578	1578
q8	2404	1935	1920	1920
q9	7305	7282	7213	7213
q10	4521	4430	3956	3956
q11	544	383	383	383
q12	709	721	507	507
q13	2993	3495	2787	2787
q14	270	287	247	247
q15	q16	677	709	624	624
q17	1272	1229	1251	1229
q18	7358	6840	6842	6840
q19	1152	1085	1127	1085
q20	2219	2219	1937	1937
q21	5365	4717	4535	4535
q22	529	472	411	411
Total cold run time: 56016 ms
Total hot run time: 50687 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31205 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 832ca14461bc0faf4fbf9b0d2a91c7df1111f393, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17647	3855	3817	3817
q2	q3	10758	1360	803	803
q4	4705	479	344	344
q5	8069	2303	2126	2126
q6	386	176	138	138
q7	955	770	647	647
q8	9675	1690	1594	1594
q9	7088	4964	4934	4934
q10	6460	2120	1810	1810
q11	441	280	248	248
q12	697	429	298	298
q13	18232	3396	2793	2793
q14	264	262	237	237
q15	q16	813	772	701	701
q17	920	886	1021	886
q18	6905	5624	5516	5516
q19	1194	1440	1122	1122
q20	498	409	275	275
q21	6230	2743	2613	2613
q22	462	380	303	303
Total cold run time: 102399 ms
Total hot run time: 31205 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4623	4679	4591	4591
q2	q3	4865	5289	4606	4606
q4	2148	2317	1468	1468
q5	4825	4698	4662	4662
q6	233	173	127	127
q7	1866	1744	1380	1380
q8	2208	1925	1905	1905
q9	7222	7185	7156	7156
q10	4557	4422	3997	3997
q11	536	384	352	352
q12	716	721	512	512
q13	3012	3393	2779	2779
q14	266	276	251	251
q15	q16	676	703	625	625
q17	1264	1234	1229	1229
q18	7313	6907	6764	6764
q19	1145	1134	1105	1105
q20	2205	2211	1938	1938
q21	5329	4666	4504	4504
q22	518	454	420	420
Total cold run time: 55527 ms
Total hot run time: 50371 ms

@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 7885a6c to 85fcf56 Compare May 17, 2026 11:41
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30897 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7885a6ce76862cca4cd5fd9cc18bd2d7185fb161, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17624	3801	3808	3801
q2	q3	10822	1327	851	851
q4	4680	469	357	357
q5	7595	2284	2081	2081
q6	240	177	137	137
q7	929	754	651	651
q8	9422	1780	1582	1582
q9	6585	4875	4841	4841
q10	6446	2111	1809	1809
q11	438	264	242	242
q12	699	429	291	291
q13	18190	3354	2742	2742
q14	268	254	234	234
q15	q16	814	781	702	702
q17	950	969	987	969
q18	6741	5630	5556	5556
q19	1181	1349	1052	1052
q20	529	410	258	258
q21	5902	2593	2437	2437
q22	432	368	304	304
Total cold run time: 100487 ms
Total hot run time: 30897 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4170	4076	4055	4055
q2	q3	4480	4864	4305	4305
q4	2090	2193	1375	1375
q5	4317	4301	4236	4236
q6	232	186	142	142
q7	2099	1928	1768	1768
q8	2488	2175	2103	2103
q9	8040	7818	7854	7818
q10	4638	4567	4118	4118
q11	600	417	385	385
q12	930	785	559	559
q13	3296	3552	2995	2995
q14	291	300	271	271
q15	q16	717	763	628	628
q17	1336	1298	1299	1298
q18	7931	7265	7050	7050
q19	1160	1067	1110	1067
q20	2216	2191	1930	1930
q21	5220	4542	4384	4384
q22	508	455	405	405
Total cold run time: 56759 ms
Total hot run time: 50892 ms

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review completed for PR 63322 at 95ccafea2253112ebae2ad285d71df8be7689d9b.

I reviewed the current diff and the existing inline review threads first. I did not find any additional distinct blocking issue to raise as a new inline comment. Previously raised topics around nullable Variant load handling, raw-number parser overhead, escaped object keys, nullable decimal casts, doc-mode decimal handling, exponent-form decimals, untyped Variant number behavior, and parser header self-containment were treated as already-known review context and were not duplicated.

Critical checkpoint conclusions:

  • Goal and tests: The PR aims to preserve exact JSON numeric tokens for Variant predefined decimal typed paths, including stream-load paths, arrays, glob typed paths, nullable casts, and scientific notation. The added BE unit tests and regression suite cover the main intended non-doc-mode paths.
  • Scope and focus: The change is focused on Variant JSON parsing/materialization and decimal casting. The larger parser refactor is now gated by decimal preserve configuration rather than applied to all parses.
  • Concurrency: No new shared mutable runtime state was added except reuse of existing glob regex cache locking. The parser changes use per-parser owned state and do not introduce new concurrent access patterns.
  • Lifecycle: The ondemand parser path copies parsed values into owned Node storage before returning parse results, so no lifetime issue was identified in the reviewed current code.
  • Configuration: No new runtime configuration item was added.
  • Compatibility: No persisted storage format or FE-BE protocol layout change was identified. Doc-mode predefined decimal is covered as unsupported by the added negative regression case.
  • Parallel code paths: Stream-load and SQL/materialization paths were both considered; the current tests cover object-form and string-encoded stream load rows.
  • Conditional checks: The decimal raw-number preservation is now path-configured and limited to typed decimal paths or glob matches; I did not identify a new incorrect condition.
  • Test coverage: Coverage is reasonable for the targeted behavior. I did not run tests in this review runner.
  • Test results: The new regression case uses assertions rather than a handwritten .out; the modified .out reflects an existing suite output change.
  • Observability: No new observability requirement was identified for this parser/cast correctness fix.
  • Transactions and persistence: The reviewed changes do not modify transaction, edit-log, publish, or rowset visibility logic.
  • Data writes and modifications: The write-path changes preserve Variant materialization for typed decimal paths; no new atomicity issue was identified beyond previously reviewed concerns.
  • FE/BE variables: No new FE-to-BE variable passing was added.
  • Performance: The expensive ondemand materialization path is gated by typed decimal preservation. I did not find an additional performance issue distinct from the existing review context.
  • Other issues: No additional distinct issue found.

User focus points: No additional user-provided review focus was specified.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31380 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6d77c2e151db0215a94add25f0ed9e0300fa3649, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17667	4058	4099	4058
q2	q3	10791	1420	804	804
q4	4679	466	352	352
q5	7557	2267	2148	2148
q6	378	175	139	139
q7	985	764	679	679
q8	9374	1690	1549	1549
q9	6793	4930	4934	4930
q10	6463	2127	1808	1808
q11	435	269	244	244
q12	690	433	290	290
q13	18213	3302	2805	2805
q14	275	255	243	243
q15	q16	828	768	717	717
q17	1023	974	934	934
q18	6914	5870	5656	5656
q19	1300	1228	1066	1066
q20	524	422	265	265
q21	5795	2634	2385	2385
q22	435	354	308	308
Total cold run time: 101119 ms
Total hot run time: 31380 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4281	4214	4305	4214
q2	q3	4532	4940	4301	4301
q4	2117	2176	1401	1401
q5	4387	4299	4632	4299
q6	247	200	143	143
q7	2012	1806	1696	1696
q8	2494	2122	2150	2122
q9	7786	7855	7792	7792
q10	4539	4510	4122	4122
q11	636	422	386	386
q12	707	749	512	512
q13	3309	3654	2958	2958
q14	300	297	291	291
q15	q16	727	723	661	661
q17	1425	1351	1368	1351
q18	7999	7528	6945	6945
q19	1151	1099	1108	1099
q20	2212	2215	1936	1936
q21	5346	4697	4491	4491
q22	522	472	406	406
Total cold run time: 56729 ms
Total hot run time: 51126 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30935 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 95ccafea2253112ebae2ad285d71df8be7689d9b, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17920	3941	3909	3909
q2	q3	10782	1407	802	802
q4	4687	471	356	356
q5	7587	2266	2101	2101
q6	232	174	138	138
q7	936	788	632	632
q8	9386	1753	1569	1569
q9	5075	4879	4869	4869
q10	6401	2134	1817	1817
q11	431	270	240	240
q12	624	440	302	302
q13	18084	3369	2686	2686
q14	260	254	235	235
q15	q16	825	773	710	710
q17	991	865	909	865
q18	6860	5716	5726	5716
q19	1361	1212	1097	1097
q20	533	410	266	266
q21	6103	2624	2324	2324
q22	418	360	301	301
Total cold run time: 99496 ms
Total hot run time: 30935 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4158	4115	4082	4082
q2	q3	4467	4892	4283	4283
q4	2117	2189	1385	1385
q5	4372	4261	4273	4261
q6	225	177	130	130
q7	1733	2140	1821	1821
q8	2590	2148	2122	2122
q9	7911	8092	7663	7663
q10	4548	4501	4042	4042
q11	552	408	375	375
q12	728	720	525	525
q13	3298	3687	2951	2951
q14	315	310	274	274
q15	q16	718	733	636	636
q17	1330	1296	1290	1290
q18	8015	7320	7179	7179
q19	1163	1131	1159	1131
q20	2200	2215	1932	1932
q21	5288	4573	4439	4439
q22	516	464	415	415
Total cold run time: 56244 ms
Total hot run time: 50936 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170186 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 6d77c2e151db0215a94add25f0ed9e0300fa3649, data reload: false

query5	4315	663	516	516
query6	331	221	197	197
query7	4278	553	296	296
query8	323	230	225	225
query9	8832	4064	4025	4025
query10	456	344	288	288
query11	5863	2411	2205	2205
query12	175	127	121	121
query13	1330	640	432	432
query14	6024	5415	5108	5108
query14_1	4414	4396	4377	4377
query15	210	205	176	176
query16	1014	448	417	417
query17	1025	706	590	590
query18	2484	495	348	348
query19	236	205	161	161
query20	135	132	129	129
query21	220	146	117	117
query22	13773	13494	13421	13421
query23	17140	16530	16119	16119
query23_1	16389	16401	16291	16291
query24	7614	1746	1280	1280
query24_1	1326	1296	1264	1264
query25	559	473	428	428
query26	1368	319	175	175
query27	2641	560	343	343
query28	4458	1921	1923	1921
query29	1006	621	490	490
query30	328	241	198	198
query31	1102	1055	959	959
query32	87	76	74	74
query33	540	348	294	294
query34	1158	1222	642	642
query35	765	776	691	691
query36	1359	1333	1152	1152
query37	157	106	92	92
query38	3250	3150	3074	3074
query39	950	941	894	894
query39_1	889	882	903	882
query40	252	158	130	130
query41	72	69	70	69
query42	117	113	111	111
query43	340	335	304	304
query44	
query45	217	204	208	204
query46	1102	1185	737	737
query47	2337	2330	2156	2156
query48	411	428	301	301
query49	657	524	402	402
query50	982	363	251	251
query51	4446	4314	4212	4212
query52	107	106	96	96
query53	261	285	209	209
query54	331	286	268	268
query55	94	92	88	88
query56	324	325	325	325
query57	1419	1383	1267	1267
query58	311	285	274	274
query59	1607	1695	1486	1486
query60	343	333	322	322
query61	183	184	182	182
query62	670	632	554	554
query63	243	205	214	205
query64	2500	882	693	693
query65	
query66	1738	493	362	362
query67	30381	30091	30122	30091
query68	
query69	485	352	307	307
query70	1016	986	1040	986
query71	320	280	281	280
query72	3047	2657	2420	2420
query73	860	811	436	436
query74	5086	4925	4737	4737
query75	2700	2591	2261	2261
query76	2284	1149	764	764
query77	402	419	333	333
query78	12160	12161	11651	11651
query79	1479	1022	699	699
query80	1242	557	496	496
query81	510	279	244	244
query82	1355	158	122	122
query83	359	277	261	261
query84	302	138	110	110
query85	937	539	455	455
query86	452	351	320	320
query87	3438	3379	3237	3237
query88	3593	2649	2632	2632
query89	458	382	336	336
query90	1796	181	185	181
query91	180	177	142	142
query92	82	88	77	77
query93	1508	1417	879	879
query94	681	329	308	308
query95	689	395	434	395
query96	985	793	353	353
query97	2674	2671	2575	2575
query98	239	230	228	228
query99	1134	1089	986	986
Total cold run time: 255506 ms
Total hot run time: 170186 ms

@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 95ccafe to 1e5d7d3 Compare May 17, 2026 15:55
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 168730 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 95ccafea2253112ebae2ad285d71df8be7689d9b, data reload: false

query5	4335	655	509	509
query6	325	212	198	198
query7	4229	522	299	299
query8	324	233	258	233
query9	8820	3970	3966	3966
query10	444	334	293	293
query11	5803	2417	2187	2187
query12	176	130	123	123
query13	1265	598	428	428
query14	5878	5330	4983	4983
query14_1	4290	4300	4314	4300
query15	208	203	178	178
query16	1056	425	411	411
query17	975	737	599	599
query18	2481	483	362	362
query19	220	230	161	161
query20	136	133	134	133
query21	212	136	120	120
query22	13578	13442	13366	13366
query23	17163	16228	15996	15996
query23_1	16072	16100	16157	16100
query24	7496	1709	1315	1315
query24_1	1305	1304	1314	1304
query25	548	477	416	416
query26	1306	313	168	168
query27	2729	542	326	326
query28	4481	1921	1921	1921
query29	975	608	489	489
query30	300	238	198	198
query31	1110	1057	934	934
query32	88	76	71	71
query33	554	343	288	288
query34	1171	1129	626	626
query35	743	788	671	671
query36	1336	1345	1162	1162
query37	152	104	90	90
query38	3178	3125	3036	3036
query39	933	926	892	892
query39_1	873	889	875	875
query40	237	148	128	128
query41	72	69	69	69
query42	114	113	112	112
query43	319	323	286	286
query44	
query45	211	204	204	204
query46	1100	1193	732	732
query47	2348	2351	2192	2192
query48	400	426	294	294
query49	669	521	405	405
query50	997	358	253	253
query51	4316	4223	4217	4217
query52	106	109	96	96
query53	262	284	211	211
query54	336	302	289	289
query55	96	92	88	88
query56	308	333	315	315
query57	1417	1416	1326	1326
query58	302	288	282	282
query59	1535	1655	1449	1449
query60	330	325	327	325
query61	187	179	185	179
query62	671	635	572	572
query63	248	202	212	202
query64	2476	866	692	692
query65	
query66	1763	502	366	366
query67	29409	30019	29838	29838
query68	
query69	457	349	310	310
query70	1058	964	976	964
query71	304	280	307	280
query72	3006	2657	2419	2419
query73	818	749	418	418
query74	5063	4880	4731	4731
query75	2672	2601	2259	2259
query76	2284	1153	762	762
query77	395	408	340	340
query78	12213	12153	11689	11689
query79	1500	1014	753	753
query80	645	541	456	456
query81	447	282	244	244
query82	1353	158	119	119
query83	350	285	253	253
query84	307	140	106	106
query85	886	542	465	465
query86	388	336	307	307
query87	3402	3321	3222	3222
query88	3542	2682	2641	2641
query89	452	389	351	351
query90	1958	174	181	174
query91	182	172	146	146
query92	77	81	78	78
query93	1549	1476	831	831
query94	542	330	297	297
query95	717	474	347	347
query96	1075	767	361	361
query97	2679	2670	2569	2569
query98	232	226	227	226
query99	1132	1084	998	998
Total cold run time: 251895 ms
Total hot run time: 168730 ms

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary: I did not find additional distinct blocking issues in the current head after considering the existing inline review context and avoiding duplicate comments.

Critical checkpoint conclusions:

  • Goal and tests: The PR targets preserving high-precision JSON numeric tokens for Variant decimal typed paths and delaying raw Variant JSON parsing until tablet-schema typed-path configuration is available. The added BE unit tests and regression suite cover typed decimals, scientific notation, CSV/JSON stream load, nullable Variant rows, sparse typed paths, row-store reads, and doc-mode rejection.
  • Scope and focus: The change is mostly localized to Variant parsing/materialization, decimal string casts, load readers, and segment-writer materialization order.
  • Concurrency: No new shared mutable runtime state is introduced on the write path. The existing glob regex cache remains mutex-protected; parser instances are per parser object/use.
  • Lifecycle/static initialization: No new cross-translation-unit static initialization dependency was found. The new parser tree storage is owned by the parser instance and consumed before the next parse.
  • Configuration: No new user-facing config item is added. Existing Variant/doc-mode behavior is explicitly rejected for decimal typed paths.
  • Compatibility/storage format: No storage format change was identified; the change affects how incoming raw JSON is materialized before existing storage conversion.
  • Parallel paths: JSON stream load, CSV/text load via shared reader paths, segment writer, vertical segment writer, row store, and partial-update paths were checked at a code-review level.
  • Tests/results: Added tests follow deterministic assertions and drop tables before use. I did not run the full regression or BE unit suites in this runner.
  • Observability: No new long-running distributed operation was added; existing errors and Status propagation appear sufficient for this scoped change.
  • Transaction/persistence/data correctness: No EditLog or transaction metadata change is involved. The important data-correctness path is Variant materialization before row-store/column conversion, which is now ordered before serialization in the modified writer paths.
  • Performance/memory: The current head only switches to raw-number-preserving ondemand parsing when decimal typed paths or a matcher are configured, which avoids the unconditional hot-path cost noted in earlier review context.

Focus-point response: No additional user-provided review focus was supplied.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31342 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 1e5d7d35ae5f95302d9c310ad534a8d0cafd09d5, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17595	3915	4019	3915
q2	q3	10759	1411	807	807
q4	4680	472	343	343
q5	7544	2344	2126	2126
q6	241	196	147	147
q7	950	798	642	642
q8	9669	1672	1574	1574
q9	5275	5011	5098	5011
q10	6390	2077	1795	1795
q11	480	274	249	249
q12	643	434	295	295
q13	18107	3412	2722	2722
q14	270	254	238	238
q15	q16	822	765	701	701
q17	953	960	854	854
q18	6955	5649	5555	5555
q19	1299	1260	1030	1030
q20	541	568	316	316
q21	6016	2830	2694	2694
q22	480	384	328	328
Total cold run time: 99669 ms
Total hot run time: 31342 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4954	4621	4529	4529
q2	q3	4863	5262	4579	4579
q4	2117	2232	1398	1398
q5	4737	4810	4690	4690
q6	236	175	131	131
q7	1915	1820	1518	1518
q8	2353	2139	2183	2139
q9	7873	7617	7318	7318
q10	4489	4400	3983	3983
q11	534	414	359	359
q12	711	730	505	505
q13	3030	3436	2827	2827
q14	277	278	253	253
q15	q16	675	694	622	622
q17	1267	1237	1262	1237
q18	7346	6938	6742	6742
q19	1135	1097	1127	1097
q20	2203	2218	1949	1949
q21	5392	4690	4661	4661
q22	522	460	416	416
Total cold run time: 56629 ms
Total hot run time: 50953 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 72.04% (304/422) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.51% (20659/38607)
Line Coverage 37.17% (195304/525439)
Region Coverage 33.55% (152875/455703)
Branch Coverage 34.60% (66711/192829)

@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 1e5d7d3 to 45d9060 Compare May 17, 2026 16:16
@eldenmoon
Copy link
Copy Markdown
Member Author

/review

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169675 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 1e5d7d35ae5f95302d9c310ad534a8d0cafd09d5, data reload: false

query5	4320	667	531	531
query6	340	227	203	203
query7	4223	567	306	306
query8	328	234	227	227
query9	8845	4055	4018	4018
query10	452	341	303	303
query11	5754	2423	2211	2211
query12	178	128	128	128
query13	1266	607	428	428
query14	6031	5395	5028	5028
query14_1	4330	4380	4353	4353
query15	213	200	182	182
query16	1019	439	442	439
query17	942	721	575	575
query18	2448	485	358	358
query19	211	203	161	161
query20	134	142	126	126
query21	216	137	120	120
query22	13560	13639	13387	13387
query23	17242	16446	15978	15978
query23_1	16203	16173	16160	16160
query24	7407	1733	1292	1292
query24_1	1302	1320	1306	1306
query25	566	484	434	434
query26	1316	335	192	192
query27	2684	573	337	337
query28	4569	1953	1907	1907
query29	993	630	526	526
query30	309	243	207	207
query31	1114	1067	943	943
query32	105	79	79	79
query33	548	363	335	335
query34	1180	1148	648	648
query35	768	773	670	670
query36	1303	1349	1092	1092
query37	151	107	92	92
query38	3242	3138	3078	3078
query39	936	916	901	901
query39_1	897	877	879	877
query40	229	165	135	135
query41	83	74	63	63
query42	111	112	109	109
query43	331	329	303	303
query44	
query45	208	196	192	192
query46	1048	1168	734	734
query47	2330	2263	2125	2125
query48	405	432	285	285
query49	628	496	377	377
query50	963	345	274	274
query51	4396	4239	4231	4231
query52	116	113	96	96
query53	256	285	207	207
query54	324	263	263	263
query55	95	91	85	85
query56	301	306	310	306
query57	1389	1374	1281	1281
query58	335	271	276	271
query59	1556	1598	1442	1442
query60	312	327	319	319
query61	195	162	153	153
query62	664	639	545	545
query63	239	204	210	204
query64	2445	854	652	652
query65	
query66	1745	463	357	357
query67	30068	30008	29412	29412
query68	
query69	469	333	300	300
query70	952	1013	1013	1013
query71	311	267	271	267
query72	3168	2765	2418	2418
query73	852	737	423	423
query74	5079	4897	4758	4758
query75	2698	2607	2256	2256
query76	2259	1164	794	794
query77	418	421	326	326
query78	12247	11912	11595	11595
query79	1261	1043	739	739
query80	596	550	461	461
query81	456	279	245	245
query82	242	161	120	120
query83	274	281	248	248
query84	288	134	115	115
query85	890	527	441	441
query86	368	346	357	346
query87	3368	3386	3243	3243
query88	3532	2680	2692	2680
query89	430	388	337	337
query90	2156	187	183	183
query91	178	169	138	138
query92	78	76	73	73
query93	1568	1439	887	887
query94	548	352	281	281
query95	689	486	342	342
query96	1028	797	341	341
query97	2679	2727	2580	2580
query98	236	227	223	223
query99	1093	1104	990	990
Total cold run time: 251776 ms
Total hot run time: 169675 ms

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional correctness issue beyond the existing review threads.

Critical checkpoint conclusions:

  • Goal/test: the PR aims to preserve high-precision JSON numbers for Variant typed decimal paths across load and row-store paths. The added tests cover Decimal128-scale decimals, scientific notation, nullable casts, JSON/CSV/Hive text, and row-store, but miss Decimal256 integer tokens beyond Int128.
  • Scope/clarity: the changes are mostly focused on schema-driven Variant materialization and row-store ordering.
  • Concurrency/lifecycle: no new concurrent state or non-obvious lifecycle issue found in the changed paths.
  • Config/compatibility: no new dynamic config or storage format incompatibility found; doc-mode decimal typed paths are rejected rather than silently losing precision.
  • Parallel paths: JSON, CSV, Hive text, segment writer, and vertical segment writer paths were checked; the Decimal256 gap applies to all paths that enable raw-number preservation.
  • Error handling/data correctness: the new ondemand raw-number path can reject valid Decimal256 typed-path input before decimal casting, causing incorrect load behavior.
  • Test coverage: missing a typed decimalv3(76,0)/Decimal256 integer larger than Int128.
  • Observability/transactions/persistence/FE-BE variables: not materially affected by this PR.
  • User focus: no additional user-provided review focus was specified.

Comment thread be/src/util/json/simd_json_parser.h
### What problem does this PR solve?

Issue Number: CIR-20240

Related PR: None

Problem Summary: Variant JSON ingestion parsed high precision decimal-like numbers as double before predefined decimal typed paths were cast, which rounded values and made decimal predicates miss exact JSON input. Preserve raw JSON number tokens only for Variant predefined decimal typed paths during subcolumn materialization, including glob typed-path patterns and precision-sensitive exponent notation. JSON, CSV, and text load paths now keep Variant raw JSON text until the segment writer can apply tablet-schema typed path config, while row-store columns are serialized after schema-aware Variant materialization. Untyped and doc-value parsing keep numeric behavior unchanged, and nullable string inputs advance offsets correctly in strict decimal batch parsing.

### Release note

Fix precision loss when loading high precision JSON numbers into Variant predefined decimal typed paths.

### Check List (For Author)

- Test: Regression test / Unit Test / Manual test
    - Unit Test: ./run-be-ut.sh --run --filter='JsonParserTest.*:ColumnVariantTest.parse_high_precision_decimal_for_typed_path:ColumnVariantTest.cast_nullable_string_to_decimal_skips_null_payload'
    - Regression test: env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY -u all_proxy -u ALL_PROXY NO_PROXY=127.0.0.1,localhost,10.26.20.3 no_proxy=127.0.0.1,localhost,10.26.20.3 ./run-regression-test.sh --conf tmp/regression-conf.auto.groovy --run -d variant_p0/predefine -s test_variant_high_precision_decimal
    - Build: BUILD_TYPE=ASAN USE_MEM_TRACKER=ON ./build.sh --be --fe
    - Format Check: PATH=/tmp/doris-codex-clang:$PATH build-support/clang-format.sh; PATH=/tmp/doris-codex-clang:$PATH build-support/check-format.sh; git diff --check; git diff --cached --check
    - Static Check: build-support/run-clang-tidy.sh could not run because clang-tidy was not found in the local environment
- Behavior changed: Yes. Variant predefined decimal typed paths preserve exact high precision JSON number text for decimal casts during subcolumn materialization and JSON/CSV/text load. Row-store serialization for direct/schema-change writes now sees materialized Variant values. Untyped and doc-value parsing keep regular numeric behavior. Decimal typed paths in doc mode remain unsupported by FE and are covered by an expected-error regression.
- Does this need documentation: No
@eldenmoon eldenmoon force-pushed the codex/cir-20240-variant-precision-pr branch from 45d9060 to b2ec813 Compare May 17, 2026 16:38
@eldenmoon
Copy link
Copy Markdown
Member Author

/review

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 72.04% (304/422) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.51% (20659/38607)
Line Coverage 37.17% (195318/525439)
Region Coverage 33.56% (152947/455703)
Branch Coverage 34.60% (66716/192829)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30776 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 45d9060c080f192ecb190462a4b01b6485daad2a, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17637	3846	3878	3846
q2	q3	10745	1410	812	812
q4	4688	475	352	352
q5	7634	2237	2071	2071
q6	240	177	138	138
q7	940	776	639	639
q8	9392	1666	1578	1578
q9	6581	4902	4886	4886
q10	6432	2123	1771	1771
q11	433	271	248	248
q12	693	422	291	291
q13	18247	3329	2784	2784
q14	264	254	232	232
q15	q16	823	771	703	703
q17	972	940	935	935
q18	6815	5835	5529	5529
q19	1238	1165	1089	1089
q20	505	392	269	269
q21	5620	2596	2303	2303
q22	426	359	300	300
Total cold run time: 100325 ms
Total hot run time: 30776 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4247	4107	4074	4074
q2	q3	4542	4868	4334	4334
q4	2113	2171	1372	1372
q5	4366	4252	4227	4227
q6	222	174	133	133
q7	2061	1857	1706	1706
q8	2536	2084	2035	2035
q9	7781	7743	7732	7732
q10	4530	4478	4088	4088
q11	617	427	373	373
q12	711	736	532	532
q13	3480	3639	3034	3034
q14	280	311	291	291
q15	q16	747	739	675	675
q17	1373	1335	1302	1302
q18	7950	7343	7098	7098
q19	1099	1077	1125	1077
q20	2229	2222	1941	1941
q21	5280	4647	4455	4455
q22	549	454	406	406
Total cold run time: 56713 ms
Total hot run time: 50885 ms

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review summary for PR 63322 at b2ec813.

I reviewed the full PR diff and the related Variant JSON parse/materialization flow, including BE parser changes, Variant serde raw-load behavior, decimal casts, CSV/hive text/JSON load paths, segment writer row-store ordering, and the new BE/regression tests. I also checked the existing inline review threads first and did not find any additional distinct issue to raise without duplicating already-known comments.

Critical checkpoint conclusions:

  • Goal/test coverage: The PR targets preserving high-precision JSON numeric tokens for predefined Variant decimal paths and row-store/load paths. It adds BE unit tests and a regression suite covering insert, stream load, CSV, hive text, row store, Decimal256, arrays, glob paths, nullable Variant, and prepared point-query behavior.
  • Scope/focus: The change is mostly focused on Variant decimal preservation, but several existing review threads already call out hot-path parsing/allocation and semantic edge cases that should be resolved before merge.
  • Concurrency/lifecycle: No new thread or lock lifecycle changes were identified in the modified paths. The existing global glob regex cache continues to use mutex protection.
  • Configuration/compatibility: No new config item or storage format field was added. The change affects BE ingestion/materialization behavior for Variant typed paths, so mixed behavior should be validated by the existing compatibility expectations around Variant parsing.
  • FE/BE path coverage: I checked JSON, CSV, hive text, segment writer, vertical segment writer, row-store serialization ordering, and typed-path materialization. No additional unreviewed FE-to-BE variable propagation issue was found.
  • Error handling/status: The newly touched Status-returning paths appear to propagate errors with RETURN_IF_ERROR/RETURN_IF_CATCH_EXCEPTION. No additional ignored Status issue was found.
  • Memory/performance: The existing review threads already cover the main memory/performance concern around raw-number preservation and recursive tree materialization; I did not add a duplicate comment.
  • Data correctness: The existing review threads already cover the major precision, nullable, doc-mode, escaped-key, and Decimal256 correctness risks. I did not find another distinct correctness issue in the remaining diff.
  • Observability: No new observability requirement stood out for this parsing/materialization change beyond existing errors/tests.

User focus points: No additional user-provided review focus was specified.

Overall: I am not submitting duplicate inline comments. Please address the already-open review threads; after those are resolved, the remaining reviewed changes did not reveal a separate blocker from this pass.

@eldenmoon eldenmoon dismissed github-actions[bot]’s stale review May 17, 2026 16:49

Addressed by b2ec813: big_integer raw tokens are preserved without requiring Int128 conversion, and Decimal256 regression coverage was added.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31453 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b2ec8138ab58c2542048d496e500a78733c50214, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17610	3863	3847	3847
q2	q3	10801	1346	812	812
q4	4689	479	342	342
q5	7605	2333	2078	2078
q6	238	180	139	139
q7	956	785	630	630
q8	9361	1803	1662	1662
q9	5209	4925	4915	4915
q10	6385	2072	1814	1814
q11	435	266	245	245
q12	624	424	299	299
q13	18134	3414	2779	2779
q14	266	253	238	238
q15	q16	820	772	712	712
q17	936	866	950	866
q18	6898	5765	5551	5551
q19	1313	1383	1094	1094
q20	528	406	367	367
q21	6266	2915	2740	2740
q22	463	376	323	323
Total cold run time: 99537 ms
Total hot run time: 31453 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4777	4539	4596	4539
q2	q3	4838	5259	4639	4639
q4	2122	2225	1407	1407
q5	4758	4830	4716	4716
q6	234	176	129	129
q7	1869	1805	1527	1527
q8	2374	2042	2030	2030
q9	7614	7484	7206	7206
q10	4489	4410	3998	3998
q11	523	383	347	347
q12	700	714	509	509
q13	3051	3376	2797	2797
q14	284	273	249	249
q15	q16	678	699	605	605
q17	1260	1256	1240	1240
q18	7507	6778	6871	6778
q19	1118	1090	1073	1073
q20	2200	2219	1944	1944
q21	5318	4682	4501	4501
q22	522	469	411	411
Total cold run time: 56236 ms
Total hot run time: 50645 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169135 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 45d9060c080f192ecb190462a4b01b6485daad2a, data reload: false

query5	4345	633	524	524
query6	344	215	204	204
query7	4308	568	299	299
query8	325	230	220	220
query9	8828	4060	4025	4025
query10	478	351	311	311
query11	5832	2375	2237	2237
query12	187	132	129	129
query13	1320	625	419	419
query14	5986	5345	5039	5039
query14_1	4378	4355	4361	4355
query15	215	203	180	180
query16	995	439	456	439
query17	1164	751	598	598
query18	2727	483	356	356
query19	232	218	167	167
query20	152	136	131	131
query21	221	142	118	118
query22	13605	13473	13341	13341
query23	17244	16193	15908	15908
query23_1	16300	16343	16467	16343
query24	7897	1790	1338	1338
query24_1	1327	1357	1333	1333
query25	609	513	471	471
query26	1315	345	179	179
query27	2673	552	344	344
query28	4437	1990	1926	1926
query29	1028	644	516	516
query30	309	247	202	202
query31	1117	1068	940	940
query32	89	77	81	77
query33	536	356	294	294
query34	1166	1136	650	650
query35	761	785	711	711
query36	1317	1341	1248	1248
query37	154	105	91	91
query38	3186	3161	3061	3061
query39	921	909	902	902
query39_1	871	872	865	865
query40	231	148	124	124
query41	65	63	61	61
query42	112	108	105	105
query43	321	318	286	286
query44	
query45	206	201	191	191
query46	1045	1209	718	718
query47	2321	2315	2180	2180
query48	401	405	274	274
query49	653	488	378	378
query50	940	357	257	257
query51	4296	4249	4194	4194
query52	105	105	92	92
query53	248	286	207	207
query54	307	264	247	247
query55	93	88	87	87
query56	318	312	322	312
query57	1403	1376	1293	1293
query58	291	270	255	255
query59	1554	1604	1406	1406
query60	313	322	300	300
query61	160	159	153	153
query62	658	631	556	556
query63	251	205	215	205
query64	2390	803	621	621
query65	
query66	1673	473	344	344
query67	29963	29917	29797	29797
query68	
query69	467	374	303	303
query70	983	998	958	958
query71	305	272	269	269
query72	3181	2637	2438	2438
query73	848	747	454	454
query74	5038	4890	4748	4748
query75	2668	2585	2250	2250
query76	2278	1131	757	757
query77	410	408	333	333
query78	12132	12020	11601	11601
query79	1486	1015	697	697
query80	1340	553	460	460
query81	524	286	238	238
query82	1026	153	119	119
query83	355	269	249	249
query84	263	142	114	114
query85	916	534	460	460
query86	450	334	313	313
query87	3469	3387	3199	3199
query88	3488	2677	2656	2656
query89	450	387	334	334
query90	1920	170	175	170
query91	178	177	145	145
query92	78	79	73	73
query93	1558	1397	848	848
query94	735	327	305	305
query95	695	390	446	390
query96	984	776	323	323
query97	2676	2667	2553	2553
query98	236	230	223	223
query99	1121	1080	981	981
Total cold run time: 254662 ms
Total hot run time: 169135 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169664 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b2ec8138ab58c2542048d496e500a78733c50214, data reload: false

query5	4313	653	512	512
query6	335	218	193	193
query7	4309	568	305	305
query8	324	233	226	226
query9	8814	3969	4001	3969
query10	443	345	293	293
query11	5785	2438	2223	2223
query12	179	129	123	123
query13	1271	636	446	446
query14	5982	5352	5043	5043
query14_1	4398	4297	4331	4297
query15	213	207	181	181
query16	987	433	426	426
query17	938	707	585	585
query18	2446	507	347	347
query19	208	201	152	152
query20	136	132	130	130
query21	212	136	119	119
query22	13723	13770	13339	13339
query23	17210	16383	16012	16012
query23_1	16176	16112	16232	16112
query24	7414	1755	1308	1308
query24_1	1292	1299	1311	1299
query25	579	483	413	413
query26	1313	311	170	170
query27	2708	563	337	337
query28	4525	1946	1939	1939
query29	1000	615	494	494
query30	305	239	199	199
query31	1124	1060	939	939
query32	86	73	69	69
query33	543	349	296	296
query34	1156	1115	638	638
query35	775	786	678	678
query36	1327	1308	1162	1162
query37	150	103	91	91
query38	3207	3132	3069	3069
query39	933	926	888	888
query39_1	898	886	867	867
query40	224	146	123	123
query41	66	62	66	62
query42	112	110	111	110
query43	319	326	286	286
query44	
query45	208	205	192	192
query46	1090	1177	720	720
query47	2333	2378	2235	2235
query48	413	422	284	284
query49	644	482	386	386
query50	911	343	255	255
query51	4317	4295	4276	4276
query52	104	103	94	94
query53	253	279	209	209
query54	306	266	264	264
query55	93	90	91	90
query56	302	299	299	299
query57	1455	1402	1324	1324
query58	295	268	270	268
query59	1584	1605	1440	1440
query60	326	324	331	324
query61	180	181	183	181
query62	712	619	564	564
query63	244	199	204	199
query64	2441	799	671	671
query65	
query66	1711	483	354	354
query67	30082	29352	29297	29297
query68	
query69	450	340	296	296
query70	1064	955	979	955
query71	310	270	259	259
query72	3027	2703	2505	2505
query73	847	760	439	439
query74	5104	4967	4778	4778
query75	2695	2621	2297	2297
query76	2279	1175	791	791
query77	394	419	341	341
query78	12190	12097	11548	11548
query79	1223	1013	748	748
query80	616	589	495	495
query81	457	278	246	246
query82	239	166	123	123
query83	281	282	254	254
query84	292	143	113	113
query85	948	629	532	532
query86	359	365	324	324
query87	3379	3363	3218	3218
query88	3507	2685	2667	2667
query89	413	393	339	339
query90	2202	178	179	178
query91	180	168	141	141
query92	77	78	73	73
query93	1439	1449	845	845
query94	541	348	321	321
query95	656	471	353	353
query96	1069	809	329	329
query97	2713	2688	2609	2609
query98	241	236	230	230
query99	1129	1087	943	943
Total cold run time: 251826 ms
Total hot run time: 169664 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 71.58% (330/461) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.51% (20661/38609)
Line Coverage 37.17% (195329/525461)
Region Coverage 33.56% (152929/455729)
Branch Coverage 34.60% (66729/192847)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 78.53% (373/475) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.62% (27837/37810)
Line Coverage 57.60% (301865/524098)
Region Coverage 54.90% (252633/460156)
Branch Coverage 56.38% (109135/193587)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants