[fix](be) Preserve Variant decimal precision#63322
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
Pull request overview
This PR fixes precision loss for Variant typed decimal extraction by preserving high-precision JSON numeric tokens as exact text during parsing, and ensuring typed-path casts keep Variant’s nullable/array nullability semantics. It also adds targeted unit/regression coverage for high-precision decimals in both scalar and array paths.
Changes:
- Preserve raw JSON numeric tokens (via simdjson on-demand parsing) and materialize precision-sensitive numbers as strings for later exact decimal casts.
- Preserve Variant nullability (including nested array nullability) when converting typed paths to storage types.
- Add unit + regression tests covering high-precision decimals in Variant scalar and array typed paths.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/variant_p0/predefine/test_variant_high_precision_decimal.groovy | New regression test validating high-precision decimal preservation for Variant typed paths (scalar + array). |
| be/test/core/jsonb/json_parser_test.cpp | Unit tests asserting high-precision numeric tokens are preserved as string fields (including inside arrays). |
| be/test/core/column/column_variant_test.cpp | Unit test validating typed-path conversion + serialization preserves decimal text/precision for scalars and arrays. |
| be/src/util/json/simd_json_parser.h | Refactors SimdJSONParser to build an internal node tree from simdjson on-demand and exposes raw number tokens. |
| be/src/util/json/json_parser.h | Adds heuristic to preserve precision-sensitive doubles as strings based on raw token digits. |
| be/src/exprs/function/cast/cast_to_decimal.h | Enhances string→decimal cast to accept nullable string columns and propagate nulls. |
| be/src/core/column/column_variant.cpp | Preserves Variant nullability when casting typed paths to declared storage types (incl. arrays). |
Comments suppressed due to low confidence (1)
be/src/util/json/json_parser.h:30
- json_parser.h introduces should_preserve_json_number_as_string(std::string_view ...) but the header does not include <string_view>. Relying on transitive includes is brittle; add #include <string_view> to ensure this header compiles standalone.
#include <parallel_hashmap/phmap.h>
#include <cstddef>
#include <optional>
#include <string>
#include <utility>
#include <vector>
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
I found one blocking correctness regression. The PR addresses typed Variant decimal precision for the tested path, but the parser change currently applies to all Variant subcolumn extraction and changes some JSON numbers into strings when the path is not declared as a typed decimal path.
Critical checkpoints:
- Goal/test: The PR fixes the covered typed decimal path case and adds BE/regression coverage for that case, but misses untyped Variant numeric semantics.
- Scope/focus: The implementation is broader than the stated goal because generic JSON parsing now changes inferred Field type based on token precision.
- Concurrency/lifecycle: No new shared concurrency or non-obvious lifecycle issue identified in the reviewed paths.
- Compatibility/behavior: Behavior changes for existing Variant data: high-precision JSON numeric tokens on untyped paths can become strings. This is user-visible and not covered by compatibility/tests.
- Parallel paths: Doc-value/jsonb writing still treats doubles as doubles; the subcolumn path is the one affected by the string conversion.
- Tests: Added tests cover typed decimal conversion, but no negative/regression coverage proves ordinary untyped numbers remain numeric.
- Performance/memory/observability: No additional blocking issue found beyond the correctness problem above.
User focus: No additional user-provided review focus was specified.
ecba866 to
9ccf680
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found one additional correctness issue beyond the existing review threads.
Critical checkpoint conclusions:
- Goal/test coverage: The PR aims to preserve high-precision JSON numbers for Variant decimal typed paths and adds BE/regression coverage, but scientific-notation decimal tokens remain uncovered and can still lose precision.
- Scope/focus: The change is mostly focused on Variant JSON parsing and typed-path decimal conversion, though the simdjson wrapper rewrite is broader than the local fix.
- Concurrency/lifecycle: No new shared mutable runtime state or lock-order concern found in the changed path. Parser-owned node lifetimes are local to each parse call.
- Configuration/compatibility: No new config or storage-format change found. Existing typed path schema metadata is used to drive parsing.
- Parallel paths: Typed decimal preservation is wired into normal Variant ingest, but the heuristic misses exponent notation on the same path.
- Error handling/memory safety: No ignored Status issue found in the reviewed changes. The parser rewrite adds extra allocations, but I did not find a separate blocking memory-safety issue.
- Data correctness: Blocking issue below: some valid decimal typed-path inputs still round through double before decimal conversion.
- Observability: No additional observability requirement identified for this local parsing/casting fix.
User focus: No additional user-provided review focus was specified.
9ccf680 to
832ca14
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found one additional correctness issue. The existing review threads already cover the raw-number heuristic/scientific-notation precision cases and the missing self-contained <cassert> include, so I did not duplicate those.
Critical checkpoint conclusions:
- Goal/test: the PR aims to preserve high-precision decimal JSON tokens for Variant predefined decimal paths; it adds non-doc-mode unit/regression coverage, but doc-mode predefined typed paths are not covered and still lose precision.
- Scope: the change is focused, but the fix is incomplete across the parallel doc-mode write path.
- Concurrency/lifecycle: no new shared mutable concurrency or special lifecycle issue found in the reviewed change.
- Config/compatibility: no new external config or storage-format compatibility change found.
- Parallel paths: doc-mode Variant materialization is a parallel path and needs the same decimal-preservation setup.
- Tests: add a doc-mode predefined decimal regression case, preferably under
variant_p0/doc_mode/predefine, to prove the fix. - Observability/transactions/data writes: no additional logging or transaction/persistence concern found beyond the data-correctness issue below.
- Performance: the parser now materializes/copies the JSON tree to retain raw tokens; this is worth watching on insert hot paths, but I did not mark it blocking without benchmark evidence.
User focus: no additional user-provided review focus was supplied.
832ca14 to
43b1a14
Compare
|
/review |
|
run buildall |
There was a problem hiding this comment.
I completed a full pass over the PR diff and the existing inline review context. I am requesting changes based on the already-open inline threads rather than adding duplicate comments.
Summary opinion: the PR is targeted at preserving high-precision JSON decimal tokens for predefined Variant decimal paths, and the current non-doc-mode path is closer to the goal. However, the implementation is still incomplete for doc-mode predefined Variant columns, where the parser is configured as OnlyDocValueColumn and the decimal raw-number preserve paths are still not installed. be/src/util/json/simd_json_parser.h also still uses assert() without including <cassert>, so the header remains non-self-contained. I did not find additional distinct issues beyond the existing threads.
Critical checkpoint conclusions:
- Goal and tests: the non-doc-mode typed decimal case has regression/unit coverage, but doc-mode typed decimal coverage is still missing and the doc-mode path does not accomplish the precision-preservation goal.
- Scope/focus: the functional change is mostly focused, though the simdjson wrapper refactor is larger than the minimal parser hook; no separate blocking issue found beyond the existing header/self-containment concern.
- Concurrency: no new shared mutable cross-thread state or lock interaction was introduced in the reviewed paths.
- Lifecycle/static initialization: no new static initialization dependency was found. Parser element lifetimes are local to parsing/traversal and copied into the new tree before the padded input is destroyed.
- Configuration: no new configuration items were added.
- Compatibility: no storage-format or FE-BE protocol change was found. The remaining compatibility risk is behavioral inconsistency for doc-mode predefined decimal Variant paths.
- Parallel paths: doc-mode is the parallel insert/materialization path that still needs the same decimal preserve-path setup.
- Conditional checks: the typed-path preservation condition is now path-gated; no additional distinct condition issue found.
- Test coverage/results: tests cover non-doc mode, arrays, glob matching, and scientific notation, but not doc-mode predefined decimals.
- Observability: no new observability requirement identified for this parser/cast fix.
- Transactions/persistence/data writes: no new transaction or persistence metadata path was introduced.
- Performance: the JSON parser wrapper now materializes a copied node tree, but I did not identify a concrete correctness blocker from this beyond the existing issues.
- User focus: no additional user-provided review focus was specified.
43b1a14 to
7885a6c
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found one additional correctness issue in the nullable string-to-decimal cast path. Existing inline threads already cover the remaining blockers around doc-mode typed decimal preservation and parser/header concerns, so I did not duplicate those comments.
Critical checkpoint conclusions:
- Goal/test: The PR aims to preserve exact JSON numeric text for predefined decimal Variant typed paths and adds unit/regression coverage, but doc-mode typed paths remain covered by an existing unresolved thread and nullable decimal conversion needs more coverage.
- Scope: The change is mostly focused, but the parser replacement adds broad allocation/copying behavior; no separate blocker found beyond the submitted issue and existing comments.
- Concurrency/lifecycle: No new shared concurrent state or non-obvious lifecycle risk identified; parser result references are copied into the owned tree before traversal.
- Config/compatibility: No new config or storage format incompatibility identified.
- Parallel paths: Non-doc-mode path is configured; doc-mode path remains an existing unresolved review concern.
- Error handling/data correctness: The new strict nullable cast path can misparse rows because skipped null rows do not advance string offsets in the decimal serde.
- Tests: Added tests cover high-precision predefined decimals in non-doc mode, arrays, globs, and exponent cases; missing coverage remains for doc mode (already raised) and nullable/missing typed decimal rows.
- Observability/performance: No new observability requirement identified; parser memory overhead is higher but not raised as a blocking issue.
User focus: No additional user-provided review focus was specified.
TPC-H: Total hot run time: 31288 ms |
TPC-H: Total hot run time: 31294 ms |
TPC-H: Total hot run time: 31205 ms |
7885a6c to
85fcf56
Compare
|
run buildall |
|
/review |
TPC-H: Total hot run time: 30897 ms |
|
run buildall |
There was a problem hiding this comment.
Automated review completed for PR 63322 at 95ccafea2253112ebae2ad285d71df8be7689d9b.
I reviewed the current diff and the existing inline review threads first. I did not find any additional distinct blocking issue to raise as a new inline comment. Previously raised topics around nullable Variant load handling, raw-number parser overhead, escaped object keys, nullable decimal casts, doc-mode decimal handling, exponent-form decimals, untyped Variant number behavior, and parser header self-containment were treated as already-known review context and were not duplicated.
Critical checkpoint conclusions:
- Goal and tests: The PR aims to preserve exact JSON numeric tokens for Variant predefined decimal typed paths, including stream-load paths, arrays, glob typed paths, nullable casts, and scientific notation. The added BE unit tests and regression suite cover the main intended non-doc-mode paths.
- Scope and focus: The change is focused on Variant JSON parsing/materialization and decimal casting. The larger parser refactor is now gated by decimal preserve configuration rather than applied to all parses.
- Concurrency: No new shared mutable runtime state was added except reuse of existing glob regex cache locking. The parser changes use per-parser owned state and do not introduce new concurrent access patterns.
- Lifecycle: The ondemand parser path copies parsed values into owned
Nodestorage before returning parse results, so no lifetime issue was identified in the reviewed current code. - Configuration: No new runtime configuration item was added.
- Compatibility: No persisted storage format or FE-BE protocol layout change was identified. Doc-mode predefined decimal is covered as unsupported by the added negative regression case.
- Parallel code paths: Stream-load and SQL/materialization paths were both considered; the current tests cover object-form and string-encoded stream load rows.
- Conditional checks: The decimal raw-number preservation is now path-configured and limited to typed decimal paths or glob matches; I did not identify a new incorrect condition.
- Test coverage: Coverage is reasonable for the targeted behavior. I did not run tests in this review runner.
- Test results: The new regression case uses assertions rather than a handwritten
.out; the modified.outreflects an existing suite output change. - Observability: No new observability requirement was identified for this parser/cast correctness fix.
- Transactions and persistence: The reviewed changes do not modify transaction, edit-log, publish, or rowset visibility logic.
- Data writes and modifications: The write-path changes preserve Variant materialization for typed decimal paths; no new atomicity issue was identified beyond previously reviewed concerns.
- FE/BE variables: No new FE-to-BE variable passing was added.
- Performance: The expensive ondemand materialization path is gated by typed decimal preservation. I did not find an additional performance issue distinct from the existing review context.
- Other issues: No additional distinct issue found.
User focus points: No additional user-provided review focus was specified.
TPC-H: Total hot run time: 31380 ms |
TPC-H: Total hot run time: 30935 ms |
TPC-DS: Total hot run time: 170186 ms |
95ccafe to
1e5d7d3
Compare
|
run buildall |
|
/review |
TPC-DS: Total hot run time: 168730 ms |
There was a problem hiding this comment.
Review summary: I did not find additional distinct blocking issues in the current head after considering the existing inline review context and avoiding duplicate comments.
Critical checkpoint conclusions:
- Goal and tests: The PR targets preserving high-precision JSON numeric tokens for Variant decimal typed paths and delaying raw Variant JSON parsing until tablet-schema typed-path configuration is available. The added BE unit tests and regression suite cover typed decimals, scientific notation, CSV/JSON stream load, nullable Variant rows, sparse typed paths, row-store reads, and doc-mode rejection.
- Scope and focus: The change is mostly localized to Variant parsing/materialization, decimal string casts, load readers, and segment-writer materialization order.
- Concurrency: No new shared mutable runtime state is introduced on the write path. The existing glob regex cache remains mutex-protected; parser instances are per parser object/use.
- Lifecycle/static initialization: No new cross-translation-unit static initialization dependency was found. The new parser tree storage is owned by the parser instance and consumed before the next parse.
- Configuration: No new user-facing config item is added. Existing Variant/doc-mode behavior is explicitly rejected for decimal typed paths.
- Compatibility/storage format: No storage format change was identified; the change affects how incoming raw JSON is materialized before existing storage conversion.
- Parallel paths: JSON stream load, CSV/text load via shared reader paths, segment writer, vertical segment writer, row store, and partial-update paths were checked at a code-review level.
- Tests/results: Added tests follow deterministic assertions and drop tables before use. I did not run the full regression or BE unit suites in this runner.
- Observability: No new long-running distributed operation was added; existing errors and Status propagation appear sufficient for this scoped change.
- Transaction/persistence/data correctness: No EditLog or transaction metadata change is involved. The important data-correctness path is Variant materialization before row-store/column conversion, which is now ordered before serialization in the modified writer paths.
- Performance/memory: The current head only switches to raw-number-preserving ondemand parsing when decimal typed paths or a matcher are configured, which avoids the unconditional hot-path cost noted in earlier review context.
Focus-point response: No additional user-provided review focus was supplied.
TPC-H: Total hot run time: 31342 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
1e5d7d3 to
45d9060
Compare
|
/review |
|
run buildall |
TPC-DS: Total hot run time: 169675 ms |
There was a problem hiding this comment.
I found one additional correctness issue beyond the existing review threads.
Critical checkpoint conclusions:
- Goal/test: the PR aims to preserve high-precision JSON numbers for Variant typed decimal paths across load and row-store paths. The added tests cover Decimal128-scale decimals, scientific notation, nullable casts, JSON/CSV/Hive text, and row-store, but miss Decimal256 integer tokens beyond Int128.
- Scope/clarity: the changes are mostly focused on schema-driven Variant materialization and row-store ordering.
- Concurrency/lifecycle: no new concurrent state or non-obvious lifecycle issue found in the changed paths.
- Config/compatibility: no new dynamic config or storage format incompatibility found; doc-mode decimal typed paths are rejected rather than silently losing precision.
- Parallel paths: JSON, CSV, Hive text, segment writer, and vertical segment writer paths were checked; the Decimal256 gap applies to all paths that enable raw-number preservation.
- Error handling/data correctness: the new ondemand raw-number path can reject valid Decimal256 typed-path input before decimal casting, causing incorrect load behavior.
- Test coverage: missing a typed
decimalv3(76,0)/Decimal256 integer larger than Int128. - Observability/transactions/persistence/FE-BE variables: not materially affected by this PR.
- User focus: no additional user-provided review focus was specified.
### What problem does this PR solve?
Issue Number: CIR-20240
Related PR: None
Problem Summary: Variant JSON ingestion parsed high precision decimal-like numbers as double before predefined decimal typed paths were cast, which rounded values and made decimal predicates miss exact JSON input. Preserve raw JSON number tokens only for Variant predefined decimal typed paths during subcolumn materialization, including glob typed-path patterns and precision-sensitive exponent notation. JSON, CSV, and text load paths now keep Variant raw JSON text until the segment writer can apply tablet-schema typed path config, while row-store columns are serialized after schema-aware Variant materialization. Untyped and doc-value parsing keep numeric behavior unchanged, and nullable string inputs advance offsets correctly in strict decimal batch parsing.
### Release note
Fix precision loss when loading high precision JSON numbers into Variant predefined decimal typed paths.
### Check List (For Author)
- Test: Regression test / Unit Test / Manual test
- Unit Test: ./run-be-ut.sh --run --filter='JsonParserTest.*:ColumnVariantTest.parse_high_precision_decimal_for_typed_path:ColumnVariantTest.cast_nullable_string_to_decimal_skips_null_payload'
- Regression test: env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY -u all_proxy -u ALL_PROXY NO_PROXY=127.0.0.1,localhost,10.26.20.3 no_proxy=127.0.0.1,localhost,10.26.20.3 ./run-regression-test.sh --conf tmp/regression-conf.auto.groovy --run -d variant_p0/predefine -s test_variant_high_precision_decimal
- Build: BUILD_TYPE=ASAN USE_MEM_TRACKER=ON ./build.sh --be --fe
- Format Check: PATH=/tmp/doris-codex-clang:$PATH build-support/clang-format.sh; PATH=/tmp/doris-codex-clang:$PATH build-support/check-format.sh; git diff --check; git diff --cached --check
- Static Check: build-support/run-clang-tidy.sh could not run because clang-tidy was not found in the local environment
- Behavior changed: Yes. Variant predefined decimal typed paths preserve exact high precision JSON number text for decimal casts during subcolumn materialization and JSON/CSV/text load. Row-store serialization for direct/schema-change writes now sees materialized Variant values. Untyped and doc-value parsing keep regular numeric behavior. Decimal typed paths in doc mode remain unsupported by FE and are covered by an expected-error regression.
- Does this need documentation: No
45d9060 to
b2ec813
Compare
|
/review |
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 30776 ms |
There was a problem hiding this comment.
Automated review summary for PR 63322 at b2ec813.
I reviewed the full PR diff and the related Variant JSON parse/materialization flow, including BE parser changes, Variant serde raw-load behavior, decimal casts, CSV/hive text/JSON load paths, segment writer row-store ordering, and the new BE/regression tests. I also checked the existing inline review threads first and did not find any additional distinct issue to raise without duplicating already-known comments.
Critical checkpoint conclusions:
- Goal/test coverage: The PR targets preserving high-precision JSON numeric tokens for predefined Variant decimal paths and row-store/load paths. It adds BE unit tests and a regression suite covering insert, stream load, CSV, hive text, row store, Decimal256, arrays, glob paths, nullable Variant, and prepared point-query behavior.
- Scope/focus: The change is mostly focused on Variant decimal preservation, but several existing review threads already call out hot-path parsing/allocation and semantic edge cases that should be resolved before merge.
- Concurrency/lifecycle: No new thread or lock lifecycle changes were identified in the modified paths. The existing global glob regex cache continues to use mutex protection.
- Configuration/compatibility: No new config item or storage format field was added. The change affects BE ingestion/materialization behavior for Variant typed paths, so mixed behavior should be validated by the existing compatibility expectations around Variant parsing.
- FE/BE path coverage: I checked JSON, CSV, hive text, segment writer, vertical segment writer, row-store serialization ordering, and typed-path materialization. No additional unreviewed FE-to-BE variable propagation issue was found.
- Error handling/status: The newly touched Status-returning paths appear to propagate errors with RETURN_IF_ERROR/RETURN_IF_CATCH_EXCEPTION. No additional ignored Status issue was found.
- Memory/performance: The existing review threads already cover the main memory/performance concern around raw-number preservation and recursive tree materialization; I did not add a duplicate comment.
- Data correctness: The existing review threads already cover the major precision, nullable, doc-mode, escaped-key, and Decimal256 correctness risks. I did not find another distinct correctness issue in the remaining diff.
- Observability: No new observability requirement stood out for this parsing/materialization change beyond existing errors/tests.
User focus points: No additional user-provided review focus was specified.
Overall: I am not submitting duplicate inline comments. Please address the already-open review threads; after those are resolved, the remaining reviewed changes did not reveal a separate blocker from this pass.
Addressed by b2ec813: big_integer raw tokens are preserved without requiring Int128 conversion, and Decimal256 regression coverage was added.
TPC-H: Total hot run time: 31453 ms |
TPC-DS: Total hot run time: 169135 ms |
TPC-DS: Total hot run time: 169664 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: CIR-20240
Related PR: None
Problem Summary: Variant JSON parsing materialized high precision decimal-like numbers as double before typed decimal paths were cast, which lost precision. Preserve raw numeric tokens for decimal-sensitive Variant path extraction and cast typed decimal paths from exact text while keeping Variant nullable array structure.
Release note
Fix Variant typed decimal precision loss when parsing high precision JSON numbers.
Check List (For Author)