fix: Add schema validation for native_datafusion Parquet scan#3759
Open
vaibhawvipul wants to merge 7 commits intoapache:mainfrom
Open
fix: Add schema validation for native_datafusion Parquet scan#3759vaibhawvipul wants to merge 7 commits intoapache:mainfrom
vaibhawvipul wants to merge 7 commits intoapache:mainfrom
Conversation
When spark.comet.scan.impl=native_datafusion, DataFusion's Parquet reader silently coerces incompatible types instead of erroring like Spark does.
Member
|
Thanks for working on this @vaibhawvipul. This looks like a good start. Note that the behavior does vary between Spark versions. Spark 4 is much more permissive, for example. Could you add end-to-end integration tests, ideally using the new SQL file based testing approach or with Scala tests that compare Comet and Spark behavior. |
Member
|
@vaibhawvipul you need to run "make format" to fix lint issues |
Contributor
Author
Thank you. Fixed. |
Contributor
|
I'm tentative how we should proceed considering widening data types coerce support in Spark 4.0. Would it be better just to document that Comet in such cases allows coercion in Spark 3.x? 🤔 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When spark.comet.scan.impl=native_datafusion, DataFusion's Parquet reader silently coerces incompatible types instead of erroring like Spark does.
Which issue does this PR close?
Closes #3720 .
Rationale for this change
DataFusion is more permissive than Spark when reading Parquet files with mismatched schemas. For example, reading an INT32 column as bigint, or TimestampLTZ as TimestampNTZ, silently succeeds in DataFusion but should throw SchemaColumnConvertNotSupportedException per Spark's behavior. This breaks correctness guarantees that Spark users rely on.
What changes are included in this PR?
Adds schema compatibility validation in
schema_adapter.rs:validate_spark_schema_compatibility()checks each logical field against its physical counterpart when a file is openedis_spark_compatible_read()defines the allowlist of valid Parquet-to-Spark type conversions (matching TypeUtil's logic)"Column: [name], Expected: <type>, Found: <type>"formatHow are these changes tested?
parquet_int_as_long_should_fail- SPARK-35640: INT32 read as bigint is rejectedparquet_timestamp_ltz_as_ntz_should_fail- SPARK-36182: TimestampLTZ read as TimestampNTZ is rejectedparquet_roundtrip_unsigned_int- UInt32→Int32 (existing test, still passes)test_is_spark_compatible_read- unit test covering compatible cases (Binary→Utf8, UInt32→Int64, NTZ→LTZ, Timestamp→Int64) and incompatible cases (Utf8→Timestamp, Int32→Int64, LTZ→NTZ, Utf8→Int32, Float→Double, Decimal precision/scale mismatches)