Skip to content

feat: add native support for get_json_object expression#3747

Draft
andygrove wants to merge 6 commits intoapache:mainfrom
andygrove:get-json-object
Draft

feat: add native support for get_json_object expression#3747
andygrove wants to merge 6 commits intoapache:mainfrom
andygrove:get-json-object

Conversation

@andygrove
Copy link
Member

Which issue does this PR close?

Closes #3162.

Rationale for this change

get_json_object is a widely-used Spark function for extracting values from JSON strings using JSONPath expressions. Without native support, queries using this function fall back to Spark's JVM execution. This PR adds an initial native implementation to allow Comet to accelerate these queries.

This is a starting point. The expression is marked Incompatible and is disabled by default. Users must set spark.comet.expression.GetJsonObject.allowIncompatible=true to enable it.

What changes are included in this PR?

Rust implementation (native/spark-expr/src/string_funcs/get_json_object.rs):

  • Custom JSONPath parser supporting $ (root), .field, ['field'] (bracket notation), [n] (array index), and [*] (array wildcard)
  • Path evaluation with separate fast-path for non-wildcard paths (zero Vec allocations) and wildcard paths
  • Uses serde_json with preserve_order feature for Spark-compatible key ordering
  • 19 unit tests

Scala serde (spark/src/main/scala/org/apache/comet/serde/strings.scala):

  • CometGetJsonObject with getSupportLevel returning Incompatible (Spark's Jackson parser allows single-quoted JSON and unescaped control characters that serde_json does not)

Registration and wiring:

  • Added to stringExpressions map in QueryPlanSerde.scala
  • Registered in comet_scalar_funcs.rs via scalarFunctionExprToProtoWithReturnType

SQL tests (get_json_object.sql): 30 test queries covering field extraction, nested objects, arrays, wildcards, nulls, invalid JSON, bracket notation, edge cases.

Docs: Updated expressions.md and spark_expressions_support.md.

Current performance

Benchmarked with 1M rows of JSON (~200 bytes each) on Apple M3 Ultra:

Case Spark (ms) Comet (ms) Relative
Simple field ($.name) 705 785 0.9X
Numeric field ($.age) 725 789 0.9X
Nested field ($.address.city) 773 805 1.0X
Array element ($.items[0]) 734 795 0.9X
Nested object ($.address) 869 926 0.9X

Comet is currently ~10% slower than Spark. The primary reason is that serde_json parses the full JSON document into a DOM tree on every row, while Spark's Jackson-based implementation uses a streaming parser that can skip irrelevant fields without allocating.

Known limitations and future work

This is an initial implementation. Known gaps that could be addressed in follow-up PRs:

  1. Streaming JSON parser: Replace serde_json::from_str (full DOM parse) with a streaming approach (e.g., jiter or custom serde_json::Deserializer with IgnoredAny) to skip irrelevant JSON content without allocating. This would likely close the performance gap with Spark.
  2. $.* on arrays: Spark distinguishes $.* (object wildcard, using Wildcard token) from $[*] (array wildcard, using Subscript::Wildcard). Our parser treats both as the same Wildcard segment. Currently $.* on arrays returns values in Comet but null in Spark.
  3. Double wildcard flattening: Spark's $[*][*] triggers FlattenStyle which flattens nested arrays. Our implementation doesn't handle this special case.
  4. Single wildcard match after index: For patterns like $.arr[0][*].field, Spark's WriteStyle state machine may produce different wrapping behavior than our count-based approach.
  5. preserve_order is workspace-wide: Cargo unifies features, so enabling preserve_order on serde_json in spark-expr also enables it for all other crates in the workspace. Could be addressed by isolating the JSON parsing behind a feature flag.

How are these changes tested?

  • 19 Rust unit tests covering path parsing and evaluation edge cases
  • 30 SQL-file-based tests (CometSqlFileTestSuite) that run each query through both Spark and Comet and compare results, with dictionary encoding on/off
  • Microbenchmark (CometGetJsonObjectBenchmark) comparing Spark vs Comet performance across 5 query patterns

Implement the Spark GetJsonObject expression natively using serde_json
for JSON parsing and a custom JSONPath evaluator supporting field access,
array indexing, bracket notation, and wildcards.

Closes apache#3162
Mark as Incompatible since Spark's Jackson parser allows single-quoted
JSON and unescaped control characters which serde_json does not support.
Add allowIncompatible config to SQL test file.
- Enable serde_json preserve_order feature to maintain JSON key ordering
- Fix wildcard to only work on arrays (not objects), matching Spark
- Fix single wildcard match to preserve JSON string quoting
- Add user-facing docs in expressions.md
- Add more SQL tests: object wildcard, single match, missing fields,
  invalid paths, field names with special chars, key ordering
- Add Rust unit tests for new edge cases
Benchmarks simple field, numeric field, nested field, array element,
and nested object extraction with 1M rows of JSON data.
- Move StringBuilder import to top-level imports
- Fix doc comment to not mention object wildcard (unsupported)
- Pre-compute has_wildcard in ParsedPath struct (avoids per-row scan)
- Split evaluation into evaluate_no_wildcard (returns Option, zero Vec
  allocations) and evaluate_with_wildcard (returns Vec)
- Simplify single wildcard match: serialize directly instead of
  array-wrap-then-strip-brackets hack
- Add comment in Cargo.toml explaining preserve_order requirement
@andygrove andygrove marked this pull request as draft March 20, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Spark expression: get_json_object

1 participant