-
Notifications
You must be signed in to change notification settings - Fork 462
PARQUET-2249: Add nan_count to handle NaNs in statistics #196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -211,7 +211,7 @@ struct Statistics { | |||||
| */ | ||||||
| 1: optional binary max; | ||||||
| 2: optional binary min; | ||||||
| /** count of null value in the column */ | ||||||
| /** count of null values in the column */ | ||||||
| 3: optional i64 null_count; | ||||||
| /** count of distinct values occurring */ | ||||||
| 4: optional i64 distinct_count; | ||||||
|
|
@@ -233,6 +233,11 @@ struct Statistics { | |||||
| 7: optional bool is_max_value_exact; | ||||||
| /** If true, min_value is the actual minimum value for a column */ | ||||||
| 8: optional bool is_min_value_exact; | ||||||
| /** | ||||||
| * count of NaN values in the column; only present if physical type is FLOAT | ||||||
| * or DOUBLE | ||||||
| */ | ||||||
| 9: optional i64 nan_count; | ||||||
| } | ||||||
|
|
||||||
| /** Empty structs to use as logical type annotations */ | ||||||
|
|
@@ -909,16 +914,25 @@ union ColumnOrder { | |||||
| * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison | ||||||
| * | ||||||
| * (*) Because the sorting order is not specified properly for floating | ||||||
| * point values (relations vs. total ordering) the following | ||||||
| * compatibility rules should be applied when reading statistics: | ||||||
| * point values (relations vs. total ordering), the following compatibility | ||||||
| * rules should be applied when reading statistics: | ||||||
| * - If the min is a NaN, it should be ignored. | ||||||
| * - If the max is a NaN, it should be ignored. | ||||||
| * - If the nan_count field is set, a reader can compute | ||||||
| * nan_count + null_count == num_values to deduce whether all non-NULL | ||||||
| * values are NaN. | ||||||
| * - When looking for NaN values, min and max should be ignored. | ||||||
| * If the nan_count field is set, it can be used to check whether | ||||||
| * NaNs are present. | ||||||
| * - If the min is +0, the row group may contain -0 values as well. | ||||||
| * - If the max is -0, the row group may contain +0 values as well. | ||||||
| * - When looking for NaN values, min and max should be ignored. | ||||||
| * | ||||||
| * When writing statistics the following rules should be followed: | ||||||
| * - NaNs should not be written to min or max statistics fields. | ||||||
| * - It is suggested to always set the nan_count fields for FLOAT and | ||||||
| DOUBLE columns. | ||||||
| * - NaNs should not be written to min or max statistics fields except | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would expect to explicitly state that
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll update this with my next revision once we have decided on this issue. |
||||||
| * in the column index, where a value has to be written incase of | ||||||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll update this with my next revision once we have decided on this issue. |
||||||
| * only-NaN pages. | ||||||
| * - If the computed max value is zero (whether negative or positive), | ||||||
| * `+0.0` should be written into the max statistics field. | ||||||
| * - If the computed min value is zero (whether negative or positive), | ||||||
|
|
@@ -975,6 +989,11 @@ struct ColumnIndex { | |||||
| * Such more compact values must still be valid values within the column's | ||||||
| * logical type. Readers must make sure that list entries are populated before | ||||||
| * using them by inspecting null_pages. | ||||||
| * For columns of type FLOAT and DOUBLE, NaN values are not to be included | ||||||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| * in these bounds unless all non-null values in a page are NaN, in which | ||||||
| * case min and max should be set to NaN. Readers should always ignore NaN | ||||||
| * values in the bounds; they should check nan_pages to detect the "all | ||||||
| * non-null values are NaN" case. | ||||||
| */ | ||||||
| 2: required list<binary> min_values | ||||||
| 3: required list<binary> max_values | ||||||
|
|
@@ -989,6 +1008,23 @@ struct ColumnIndex { | |||||
|
|
||||||
| /** A list containing the number of null values for each page **/ | ||||||
| 5: optional list<i64> null_counts | ||||||
|
|
||||||
| /** | ||||||
| * A list of Boolean values to determine pages that contain only NaNs. Only | ||||||
| * present for columns of type FLOAT and DOUBLE. If true, all non-null | ||||||
| * values in a page are NaN. Writers are suggested to set the corresponding | ||||||
| * entries in min_values and max_values to NaN, so that all lists have the same | ||||||
| * length and contain valid values. If false, then either all values in the | ||||||
| * page are null or there is at least one non-null non-NaN value in the page. | ||||||
| * As readers are supposed to ignore all NaN values in bounds, legacy readers | ||||||
| * who do not consider nan_pages yet are still able to use the column index | ||||||
| * but are not able to skip only-NaN pages. | ||||||
| */ | ||||||
| 6: optional list<bool> nan_pages | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this necessary? We already know:
It seems this might be enough to infer whether a page is all-NaN (except perhaps if there are repetition levels?).
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That said, if we do need an additional list (because of repeated columns?), it might be more worthwhile to add an
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Personally I think
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, number of rows in the offset index isn't enough due to repeated values. Apart from this, the suggestions seem to turn a bit in circles now. Note that all suggestions in this thread were already mentioned in my earlier post where I depicted our possible options for the column index. @pitrou what you mentioned was my Option 2. I personally would prefer this as it feels like a useful thing to have anyway. Having said that, others pointed rightfully out that it would cost a few bytes even for non float columns. The value might be valuable for other tasks as well. For example, it could be used to quickly check how many nested values are in a page. By having these values one could sum up the nested values per column chunk by adding up all the value counts. This is currently a value that cannot be optained at all through statistics; instead one has to decode pages and count. For example, the SQL query @wgtmac your proposal was my Option 1 and actually my initial proposal (see previous commit). Note that you earlier actually were against writing NaNs and rather preferred the nan_pages approach:
Is your argument that if we now need to write the NaNs anyway, that we should in this case just use them instead of adding nan_pages? I do agree that this would save the extra field and I personally see nothing wrong in doing this. Readers need to be able to detect NaN values anyway (to ignore them), so readers should be able to use the same logic to determin min=max=NaN <=> all values are NaN. As mentioned in my previous post where I compared the three approaches, I am happy to implement any of them and I think all of them will fulfill the requirements. In my personal opinion, I like the current approach with nan_pages actually the least, as it seems redundant if we have to write NaN values anyway and I see no problem in using NaN values for the "all values NaN check". I also like the option of adding a value_counts field to the column index of all columns. It feels like a useful and missing field (that is not subsumed by offset index row counts for nested columns) and I would love to add it as well and I feel the few extra bytes will be so negligible in contrast to the actual data that no-one will ever care. Also it would enable us to do the check for all values NaN the same way in page statistics and in the column index. So we're back at the three options I proposed:
@wgtmac @mapleFU @gszadovszky @pitrou could we arrive at a consensus here? I'm happy to adapt my PR to any of the solutions. @gszadovszky you also haven't mentioned your favorite, yet (you just pointed out that we have to write some valid value).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pitrou writing anything but NaN into min/max was one of my suggestions to circumvent the problem that parquet-mr doesn't seem to check for NaN values in min/max while reading and therefore would probably yield wrong results once we start writing NaNs into these values. This would only work if we go back to maintaining either
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hi gabor, which implemention has do like that? I check C++ implemention but it doesn't do this. Maybe we can do a check here? Since I guess
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mapleFU, I did not think about any specific implementation. (TBH, I only have experince with parquet-mr.) This is mentioned in the PR description. Maybe, we do not have any implementations as such. @JFinis, I agree we should not care about the potential systems already writing NaN values into column indexes. Also agree that writing NaN values to min/max is risky for existing systems. So we need to write non-NaN valid values to min/max for all-NaN pages. (And of course mark them with either The more we narrow the range the higher the chance the page will be dropped during filtering which is good because we should not search for NaN values based on the spec anyway. What do you think about
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Trying to catching up the discussion. I like the idea to write either [-inf, +inf] or [-0.0, +0.0] for NaN-only pages. As NaN value does not have a well-defined order across systems, simply leveraging page min/max values to filter NaN does not make any sense. Therefore I think this design can break such misuses. |
||||||
|
|
||||||
| /** A list containing the number of NaN values for each page. Only present | ||||||
| * for columns of type FLOAT and DOUBLE. **/ | ||||||
| 7: optional list<i64> nan_counts | ||||||
| } | ||||||
|
|
||||||
| struct AesGcmV1 { | ||||||
|
|
||||||
Uh oh!
There was an error while loading. Please reload this page.