-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29544: Fix Vectorized Parquet reading Struct columns with all fields null #6408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -64,6 +64,8 @@ public void readBatch( | |
| int total, | ||
| ColumnVector column, | ||
| TypeInfo columnType) throws IOException { | ||
| this.currentDefLevels = new int[total]; | ||
| this.defLevelIndex = 0; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you elaborate on this change: I mean, I can see that passing and handling definition level to the struct reader solves the current issue, but this looks like fixing another bug, is it the case?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because I removed the old, flawed logic that incorrectly ANDed the child isNull flags (which was the root cause of the bug for structs with all null fields), the struct reader needs a reliable way to know when the struct itself is actually null. Fetching the Definition Levels from the primitive reader isn't a separate fix; it is the replacement mechanism for the deleted logic. It is the only correct way in Parquet to distinguish between an explicitly NULL struct and a valid struct containing null fields. If you run the test & remove these two lines, the 2nd insert NULL won't evaluate correctly If we don't initialize that array, the primitive reader skips recording the D-levels, and getDefinitionLevels() returns null. The struct reader then bypasses the D-level evaluation block entirely. Because the old flawed AND logic is gone, and the new D-level logic is bypassed, the struct vector defaults to isNull = false. This causes a genuinely NULL struct (Row 2) to incorrectly evaluate as an existing struct with null fields: {"x":null,"y":null}." |
||
| int rowId = 0; | ||
| while (total > 0) { | ||
| // Compute the number of values we want to read in this page. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -317,14 +317,15 @@ protected static void writeData(ParquetWriter<Group> writer, boolean isDictionar | |
| g.addGroup("nsf").append("c", intVal).append("d", intVal); | ||
| g.append("e", doubleVal); | ||
|
|
||
| Group some_null_g = group.addGroup("struct_field_some_null"); | ||
| if (i % 2 != 0) { | ||
| some_null_g.append("f", intVal); | ||
| } | ||
| if (i % 3 != 0) { | ||
| some_null_g.append("g", doubleVal); | ||
| if (i % 2 != 0 || i % 3 != 0) { | ||
| Group some_null_g = group.addGroup("struct_field_some_null"); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as you're already touching this part, please fix
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ouch. I missed it, let me fix it in next iter to |
||
| if (i % 2 != 0) { | ||
| some_null_g.append("f", intVal); | ||
| } | ||
| if (i % 3 != 0) { | ||
| some_null_g.append("g", doubleVal); | ||
| } | ||
| } | ||
|
|
||
| Group mapGroup = group.addGroup("map_field"); | ||
| if (i % 13 != 1) { | ||
| mapGroup.addGroup("map").append("key", binary).append("value", "abc"); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| -- SORT_QUERY_RESULTS | ||
| SET hive.vectorized.execution.enabled=true; | ||
| set hive.vectorized.execution.reduce.enabled=true; | ||
| SET hive.fetch.task.conversion=none; | ||
|
|
||
| CREATE TABLE test_parquet_struct_nulls ( | ||
| id INT, | ||
| st_prim STRUCT<x:INT, y:INT> | ||
| ) STORED AS PARQUET; | ||
|
|
||
| INSERT INTO test_parquet_struct_nulls VALUES | ||
| (1, named_struct('x', CAST(NULL AS INT), 'y', CAST(NULL AS INT))), | ||
| (2, if(1=0, named_struct('x', 1, 'y', 1), null)), | ||
| (3, named_struct('x', 3, 'y', CAST(NULL AS INT))), | ||
| (4, named_struct('x', 4, 'y', 4)); | ||
|
|
||
| -- Test A: Full table scan to check JSON representation | ||
| SELECT * FROM test_parquet_struct_nulls; | ||
|
|
||
| -- Test B: Verify IS NULL evaluates correctly | ||
| SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NULL; | ||
|
|
||
| -- Test C: Verify IS NOT NULL evaluates correctly | ||
| SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL; | ||
|
|
||
| -- Test D: Verify field-level null evaluation inside a valid struct | ||
| SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL AND st_prim.x IS NULL; | ||
|
|
||
| -- Validate without vectorization | ||
| SET hive.vectorized.execution.enabled=true; | ||
| SELECT * FROM test_parquet_struct_nulls; |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| PREHOOK: query: CREATE TABLE test_parquet_struct_nulls ( | ||
| id INT, | ||
| st_prim STRUCT<x:INT, y:INT> | ||
| ) STORED AS PARQUET | ||
| PREHOOK: type: CREATETABLE | ||
| PREHOOK: Output: database:default | ||
| PREHOOK: Output: default@test_parquet_struct_nulls | ||
| POSTHOOK: query: CREATE TABLE test_parquet_struct_nulls ( | ||
| id INT, | ||
| st_prim STRUCT<x:INT, y:INT> | ||
| ) STORED AS PARQUET | ||
| POSTHOOK: type: CREATETABLE | ||
| POSTHOOK: Output: database:default | ||
| POSTHOOK: Output: default@test_parquet_struct_nulls | ||
| PREHOOK: query: INSERT INTO test_parquet_struct_nulls VALUES | ||
| (1, named_struct('x', CAST(NULL AS INT), 'y', CAST(NULL AS INT))), | ||
| (2, if(1=0, named_struct('x', 1, 'y', 1), null)), | ||
| (3, named_struct('x', 3, 'y', CAST(NULL AS INT))), | ||
| (4, named_struct('x', 4, 'y', 4)) | ||
| PREHOOK: type: QUERY | ||
| PREHOOK: Input: _dummy_database@_dummy_table | ||
| PREHOOK: Output: default@test_parquet_struct_nulls | ||
| POSTHOOK: query: INSERT INTO test_parquet_struct_nulls VALUES | ||
| (1, named_struct('x', CAST(NULL AS INT), 'y', CAST(NULL AS INT))), | ||
| (2, if(1=0, named_struct('x', 1, 'y', 1), null)), | ||
| (3, named_struct('x', 3, 'y', CAST(NULL AS INT))), | ||
| (4, named_struct('x', 4, 'y', 4)) | ||
| POSTHOOK: type: QUERY | ||
| POSTHOOK: Input: _dummy_database@_dummy_table | ||
| POSTHOOK: Output: default@test_parquet_struct_nulls | ||
| POSTHOOK: Lineage: test_parquet_struct_nulls.id SCRIPT [] | ||
| POSTHOOK: Lineage: test_parquet_struct_nulls.st_prim SCRIPT [] | ||
| PREHOOK: query: SELECT * FROM test_parquet_struct_nulls | ||
| PREHOOK: type: QUERY | ||
| PREHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| POSTHOOK: query: SELECT * FROM test_parquet_struct_nulls | ||
| POSTHOOK: type: QUERY | ||
| POSTHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| 1 {"x":null,"y":null} | ||
| 2 NULL | ||
| 3 {"x":3,"y":null} | ||
| 4 {"x":4,"y":4} | ||
| PREHOOK: query: SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NULL | ||
| PREHOOK: type: QUERY | ||
| PREHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| POSTHOOK: query: SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NULL | ||
| POSTHOOK: type: QUERY | ||
| POSTHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| 2 | ||
| PREHOOK: query: SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL | ||
| PREHOOK: type: QUERY | ||
| PREHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| POSTHOOK: query: SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL | ||
| POSTHOOK: type: QUERY | ||
| POSTHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| 1 | ||
| 3 | ||
| 4 | ||
| PREHOOK: query: SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL AND st_prim.x IS NULL | ||
| PREHOOK: type: QUERY | ||
| PREHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| POSTHOOK: query: SELECT id FROM test_parquet_struct_nulls WHERE st_prim IS NOT NULL AND st_prim.x IS NULL | ||
| POSTHOOK: type: QUERY | ||
| POSTHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| 1 | ||
| PREHOOK: query: SELECT * FROM test_parquet_struct_nulls | ||
| PREHOOK: type: QUERY | ||
| PREHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| POSTHOOK: query: SELECT * FROM test_parquet_struct_nulls | ||
| POSTHOOK: type: QUERY | ||
| POSTHOOK: Input: default@test_parquet_struct_nulls | ||
| #### A masked pattern was here #### | ||
| 1 {"x":null,"y":null} | ||
| 2 NULL | ||
| 3 {"x":3,"y":null} | ||
| 4 {"x":4,"y":4} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the same problem applies to another complex types? better to have more test cases and maybe even rename the qfile to
parquet_complex_..., maybeparquet_complex_types_null_vectorization.qto clearly distinguish from the already existingparquet_complex_types_vectorization.qe.g.:
LIST:
or MAP ("same" as struct but not fixed schema)
or even nested struct to validate the logic on deeper recursive callpaths:
and nested data can contain NULL on different levels where you patch is actually hit I assume, e.g.:
I would appreciate a lot if this patch could validate all of these
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Struct value turning to
Null, if sub fields areNULLwas due to specific code inVectorizedStructReaderhive/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructColumnReader.java
Lines 50 to 55 in 13f3208
It used to check, if all fields have
null, it used to setnull, that I fixed.I checked the LIST & MAP. They have another bug :-) In LIST & MAP,
If you insert
NULLin them it defaults to0eg.
It outputs
Disabling vectorization gives correct
This seems some different bug, like every
NULLis treated as 0.Will it be ok, if we chase this in a different ticket. I believe it is some where it is returning default value of int instead
NULL, some check is wrong which I have to debug.Regarding nested, vectorization is disabled so, that doesn't kick in:
https://issues.apache.org/jira/browse/HIVE-19016
For map we already have a test: https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_map_null_vectorization.q
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Map there is test but the output is wrong, it is testing for
nullbut the output is having 0,https://github.com/apache/hive/blame/13f3208c01fec2d20108302efc3fd033d1d76a19/ql/src/test/results/clientpositive/llap/parquet_map_null_vectorization.q.out#L153-L158
It should have been
The inserts were
Let me know if you are ok chasing this separately