Skip to content

fix: Add schema validation for native_datafusion Parquet scan#3759

Open
vaibhawvipul wants to merge 7 commits intoapache:mainfrom
vaibhawvipul:issue-3720
Open

fix: Add schema validation for native_datafusion Parquet scan#3759
vaibhawvipul wants to merge 7 commits intoapache:mainfrom
vaibhawvipul:issue-3720

Conversation

@vaibhawvipul
Copy link
Contributor

@vaibhawvipul vaibhawvipul commented Mar 22, 2026

When spark.comet.scan.impl=native_datafusion, DataFusion's Parquet reader silently coerces incompatible types instead of erroring like Spark does.

Which issue does this PR close?

Closes #3720 .

Rationale for this change

DataFusion is more permissive than Spark when reading Parquet files with mismatched schemas. For example, reading an INT32 column as bigint, or TimestampLTZ as TimestampNTZ, silently succeeds in DataFusion but should throw SchemaColumnConvertNotSupportedException per Spark's behavior. This breaks correctness guarantees that Spark users rely on.

What changes are included in this PR?

Adds schema compatibility validation in schema_adapter.rs :

  • validate_spark_schema_compatibility() checks each logical field against its physical counterpart when a file is opened
  • is_spark_compatible_read() defines the allowlist of valid Parquet-to-Spark type conversions (matching TypeUtil's logic)
  • Incompatible reads now produce errors in "Column: [name], Expected: <type>, Found: <type>" format
  • Correctly allows INT96→LTZ (DataFusion coerces INT96 to NTZ) and Timestamp→Int64 (nanosAsLong)

How are these changes tested?

  • parquet_int_as_long_should_fail - SPARK-35640: INT32 read as bigint is rejected
  • parquet_timestamp_ltz_as_ntz_should_fail - SPARK-36182: TimestampLTZ read as TimestampNTZ is rejected
  • parquet_roundtrip_unsigned_int - UInt32→Int32 (existing test, still passes)
  • test_is_spark_compatible_read - unit test covering compatible cases (Binary→Utf8, UInt32→Int64, NTZ→LTZ, Timestamp→Int64) and incompatible cases (Utf8→Timestamp, Int32→Int64, LTZ→NTZ, Utf8→Int32, Float→Double, Decimal precision/scale mismatches)

When spark.comet.scan.impl=native_datafusion, DataFusion's Parquet reader
silently coerces incompatible types instead of erroring like Spark does.
@andygrove
Copy link
Member

andygrove commented Mar 22, 2026

Thanks for working on this @vaibhawvipul. This looks like a good start. Note that the behavior does vary between Spark versions. Spark 4 is much more permissive, for example.

Could you add end-to-end integration tests, ideally using the new SQL file based testing approach or with Scala tests that compare Comet and Spark behavior.

@andygrove
Copy link
Member

@vaibhawvipul you need to run "make format" to fix lint issues

@vaibhawvipul
Copy link
Contributor Author

@vaibhawvipul you need to run "make format" to fix lint issues

Thank you. Fixed.

@comphead
Copy link
Contributor

I'm tentative how we should proceed considering widening data types coerce support in Spark 4.0. Would it be better just to document that Comet in such cases allows coercion in Spark 3.x? 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

native_datafusion: no error thrown for schema mismatch when reading Parquet with incompatible types

3 participants