Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,10 @@ Used in contexts where precision is traded off for smaller footprint and potenti

The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`.

The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`.
The type-defined sort order for `FLOAT16` is signed (with special handling of NaNs and signed zeros),
as for `FLOAT` and `DOUBLE`. It is recommended that writers use IEEE754TotalOrder when writing columns
of this type for a well-defined handling of NaNs and signed zeros. See the `ColumnOrder` union in the
[Thrift definition](src/main/thrift/parquet.thrift) for details.

## Temporal Types

Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,9 @@ documented in [LogicalTypes.md][logical-types].
Parquet stores min/max statistics at several levels (such as Column Chunk,
Column Index, and Data Page). These statistics are according to a sort order,
which is defined for each column in the file footer. Parquet supports common
sort orders for logical and primitve types. The details are documented in the
sort orders for logical and primitve types and also special orders for types
where the common sort order is not unambiguously defined (e.g., NaN ordering
for floating point types). The details are documented in the
Comment on lines +161 to +163
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a spelling mistake (primitve). Also here is a proposal to make this more concise

Suggested change
sort orders for logical and primitve types and also special orders for types
where the common sort order is not unambiguously defined (e.g., NaN ordering
for floating point types). The details are documented in the
sort orders for logical and primitive types and also special orders for
types with potentially ambiguous semantics (e.g., NaN ordering
for floating point types). The details are documented in the

[Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union.

## Nested Encoding
Expand Down
129 changes: 122 additions & 7 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,13 @@ struct Statistics {
7: optional bool is_max_value_exact;
/** If true, min_value is the actual minimum value for a column */
8: optional bool is_min_value_exact;
/**
* count of NaN values in the column; only present if physical type is FLOAT
* or DOUBLE, or logical type is FLOAT16.
* Readers MUST distinguish between nan_count not being present and nan_count == 0.
* If nan_count is not present, readers MUST NOT assume nan_count == 0.
*/
9: optional i64 nan_count;
Comment on lines +312 to +318
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find this easier to understand if it were more concise and didn't have a double negative. Here is a suggestion:

Suggested change
/**
* count of NaN values in the column; only present if physical type is FLOAT
* or DOUBLE, or logical type is FLOAT16.
* Readers MUST distinguish between nan_count not being present and nan_count == 0.
* If nan_count is not present, readers MUST NOT assume nan_count == 0.
*/
9: optional i64 nan_count;
/**
* count of NaN values in the column; only present if physical type is FLOAT
* or DOUBLE, or logical type is FLOAT16.
* If this field is not present, readers must assume NaNs may be present
* (MUST NOT assume nan_count == 0).
*/
9: optional i64 nan_count;

}

/** Empty structs to use as logical type annotations */
Expand Down Expand Up @@ -670,7 +677,7 @@ enum BoundaryOrder {
/** Data page header */
struct DataPageHeader {
/**
* Number of values, including NULLs, in this data page.
* Number of values, including nulls, in this data page.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove these unnecessary changes? It pollutes git blame and is no more clear.

*
* If a OffsetIndex is present, a page must begin at a row
* boundary (repetition_level = 0). Otherwise, pages may begin
Expand Down Expand Up @@ -717,9 +724,9 @@ struct DictionaryPageHeader {
* The remaining section containing the data is compressed if is_compressed is true
**/
struct DataPageHeaderV2 {
/** Number of values, including NULLs, in this data page. **/
/** Number of values, including nulls, in this data page. **/
1: required i32 num_values
/** Number of NULL values, in this data page.
/** Number of null values, in this data page.
Number of non-null = num_values - num_nulls which is also the number of values in the data section **/
2: required i32 num_nulls
/**
Expand Down Expand Up @@ -1030,6 +1037,9 @@ struct RowGroup {
/** Empty struct to signal the order defined by the physical or logical type */
struct TypeDefinedOrder {}

/** Empty struct to signal IEEE 754 total order for floating point types */
struct IEEE754TotalOrder {}

/**
* Union to specify the order used for the min_value and max_value fields for a
* column. This union takes the role of an enhanced enum that allows rich
Expand All @@ -1038,6 +1048,7 @@ struct TypeDefinedOrder {}
* Possible values are:
* * TypeDefinedOrder - the column uses the order defined by its logical or
* physical type (if there is no logical type).
* * IEEE754TotalOrder - the floating point column uses IEEE 754 total order.
*
* If the reader does not support the value of this union, min and max stats
* for this column should be ignored.
Expand Down Expand Up @@ -1082,23 +1093,105 @@ union ColumnOrder {
* BYTE_ARRAY - unsigned byte-wise comparison
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (*) Because the sorting order is not specified properly for floating
* point values (relations vs. total ordering) the following
* (*) Because the precise sorting order is ambiguous for floating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be clear that this is referring to the TypeDefinedOrder for floating points.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* (*) Because the precise sorting order is ambiguous for floating
* (*) Because the TypeDefinedOrder is ambiguous for floating

Perhaps explicitly use TypeDefinedOrder or TYPE_ORDER here?

* point types due to underspecified handling of NaN and -0/+0,
* it is recommended that writers use IEEE_754_TOTAL_ORDER
* for these types.
*
* If TYPE_ORDER is used for floating point types, then the following
* compatibility rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
* - If the nan_count field is set, a reader can compute
* nan_count + null_count == num_values to deduce whether all non-null
* values are NaN.
* - When looking for NaN values, min and max should be ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving these to line 1113 as it has duplicate content.

* If the nan_count field is set, it can be used to check whether
* NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
* - When looking for NaN values, min and max should be ignored.
*
* When writing statistics the following rules should be followed:
* - NaNs should not be written to min or max statistics fields.
* - It is suggested to always set the nan_count field for floating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The heading says "the following rules should be followed" but the first one says "it is suggested" which seems to imply it is optional. I recommend removing the "suggested"

Copy link

@orlp orlp Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even 'should' is still optional per RFC2119. If this isn't optional, 'must' or 'shall' is the correct term. But this is a problem in more places in this document, so I take it you're not following that RFC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the requirement should be to set nan_count, so I agree that this should not be a suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole document uses "should", where "shall" or "must" would actually be more appropriate. Therefore, I was going with the general speak of this document for consistency. I totally agree though; it always bothered me a lot that the spec always uses "should", making it sounds as if all of this is just a mere suggestion and implementors can do what they want.

That being said, if we want to change this, I would rather do it in one PR that updates the whole document to stay consistent.

* point types, especially also if it is zero.
* - NaNs should not be written to min or max statistics fields except
* in the column index, where min_values and max_values are not optional
* so a NaN value must be written if all non-null values in a page
* are NaN.
Comment on lines +1118 to +1121
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* - NaNs should not be written to min or max statistics fields except
* in the column index, where min_values and max_values are not optional
* so a NaN value must be written if all non-null values in a page
* are NaN.
* - NaNs should not be written to min or max statistics fields except
* in the column index when a page contains only NaN values. In this
* case, since min_values and max_values are required, a NaN value
* must be written.

* - If the computed max value is zero (whether negative or positive),
* `+0.0` should be written into the max statistics field.
* - If the computed min value is zero (whether negative or positive),
* `-0.0` should be written into the min statistics field.
*/
1: TypeDefinedOrder TYPE_ORDER;

/*
* The floating point type is ordered according to the totalOrder predicate,
* as defined in section 5.10 of IEEE-754 (2008 revision). Only columns of
* physical type FLOAT or DOUBLE, or logical type FLOAT16 may use this ordering.
*
* Intuitively, this orders floats mathematically, but defines -0 to be less
* than +0, -NaN to be less than anything else, and +NaN to be greater than
* anything else. It also defines an order between different bit representations
* of the same value.
*
* The formal definition is as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to copy this here? Having a version here that duplicates the IEEE-754 version makes me worry that they will get out of sync or have errors due to copying. We should make it clear that the IEEE-754 version is correct if the two ever differ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand what lines your comment is referring to. What is the part you see duplicated?

* a) If x<y, totalOrder(x, y) is true.
* b) If x>y, totalOrder(x, y) is false.
* c) If x=y:
* 1) totalOrder(−0, +0) is true.
* 2) totalOrder(+0, −0) is false.
* 3) If x and y represent the same floating-point datum:
* i) If x and y have negative sign, totalOrder(x, y) is true if and
* only if the exponent of x ≥ the exponent of y
* ii) otherwise totalOrder(x, y) is true if and only if the exponent
* of x ≤ the exponent of y.
* d) If x and y are unordered numerically because x or y is NaN:
* 1) totalOrder(−NaN, y) is true where −NaN represents a NaN with
* negative sign bit and y is a non-NaN floating-point number.
* 2) totalOrder(x, +NaN) is true where +NaN represents a NaN with
* positive sign bit and x is a non-NaN floating-point number.
* 3) If x and y are both NaNs, then totalOrder reflects a total ordering
* based on:
* i) negative sign orders below positive sign
* ii) signaling orders below quiet for +NaN, reverse for −NaN
* iii) lesser payload, when regarded as an integer, orders below
* greater payload for +NaN, reverse for −NaN.
*
* Note that this ordering can be implemented efficiently in software by bit-wise
* operations on the integer representation of the floating point values.
* E.g., this is a possible implementation for DOUBLE in Rust:
*
* pub fn totalOrder(x: f64, y: f64) -> bool {
* let mut x_int = x.to_bits() as i64;
* let mut y_int = y.to_bits() as i64;
* x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
* y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
* return x_int <= y_int;
* }
*
* When writing statistics for columns with this order, the following rules
* must be followed:
* - Writing the nan_count field is mandatory when using this ordering,
* especialy also if it is zero.
Comment on lines +1176 to +1177
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I if the field is mandatory, the "especially" part makes no sense

Suggested change
* - Writing the nan_count field is mandatory when using this ordering,
* especialy also if it is zero.
* - Writing the nan_count field is mandatory when using this ordering.

* - NaNs should not be written to min or max statistics fields except
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spec should specify exactly what should be written, not just rule out what shouldn't be. I assume you meant to specify here that the smallest/largest non-NaN value must be written instead?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this rule (assuming I understood you correctly) is a bit overcomplicated? Not sure why we'd make the rules different depending or not whether it is a column index. May I propose the following instead which applies regardless of whether it is a column index?

When writing statistics for columns with this order, the following rules must be followed:

  • Min or max statistics if written must contain the smallest or largest non-NaN value respectively. If there are exclusively NaNs they must contain the smallest and largest NaN values respectively instead, as defined by the IEEE 754 total order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can also keep it the same for the column index and statistics here. My initial idea was to keep the behavior close to what we currently have where possible. But I agree that we can alos instead make it different from the current behavior and therefore more consistent between column index and statistics.

* in the column index, where min_values and max_values are not optional
* so a NaN value must be written if all non-null values in a page
* are NaN. In this case, the min_values[i] and max_values[i] fields
* should be set to the smallest and largest NaN values contained
* in the page, as defined by the IEEE 754 total order.
*
* When reading statistics for columns with this order, the following rules
* should be followed:
* - Readers should consult the nan_count field to determine whether NaNs
* are present.
* - A reader can compute nan_count + null_count == num_values to deduce
* whether all non-null values are NaN. In the page index, which does not
* have a num_values field, the presence of a NaN value in min_values
* or max_values indicates that all non-null values are NaN.
*/
2: IEEE754TotalOrder IEEE_754_TOTAL_ORDER;
}

struct PageLocation {
Expand Down Expand Up @@ -1170,6 +1263,19 @@ struct ColumnIndex {
* Such more compact values must still be valid values within the column's
* logical type. Readers must make sure that list entries are populated before
* using them by inspecting null_pages.
* For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16,
* NaN values are not to be included in these bounds. If all non-null values
* of a page are NaN, then a writer must do the following:
* - If the order of this column is TypeDefinedOrder, then no column index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* - If the order of this column is TypeDefinedOrder, then no column index
* - If the order of this column is TYPE_ORDER, then no column index

Just to be consistent

* must be written for this column chunk. While this is unfortunate for
* performance, it is necessary to avoid conflict with legacy files that
* still included NaN in min_values and max_values even if the page had
* non-NaN values. To mitigate this, IEEE754_TOTAL_ORDER is recommended.
* - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
* * If IEEE754_TOTAL_ORDER is used for the column and all non-null values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the heading - bullet is different:

Suggested change
* * If IEEE754_TOTAL_ORDER is used for the column and all non-null values
* - If IEEE754_TOTAL_ORDER is used for the column and all non-null values

* of a page are NaN, then min_values[i] and max_values[i] must be set to
* the smallest and largest NaN value contained in the page, as defined
* by the IEEE 754 total order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this require actual bound for NaN rather than using a standard NaN value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intentaion was here that if we already use an order that orders NaNs in a well defined manner, we can also use it. I agree we could also settle for "any NaN values is okay in this case", but that is somewhat going against the spirit of IEEE total order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it is all these strange special cases that are why I still prefer the total order PR - nan_counts sound simple but in practice there be dragons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tustvold, I don't think the added complexity of this proposal is worth it given the meager added benefit. This proposal still means that query engines that do use IEEE total ordering will be unable to filter due to the lack of knowledge of the sign of the NaNs seen. This is the flip side of total ordering not quite working for engines that lump all NaNs together. So the only real benefit to this proposal is to not "poison" the statistics.

Comment on lines +1274 to +1278
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
* * If IEEE754_TOTAL_ORDER is used for the column and all non-null values
* of a page are NaN, then min_values[i] and max_values[i] must be set to
* the smallest and largest NaN value contained in the page, as defined
* by the IEEE 754 total order.
* - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
* and max_values[i] must be set to the smallest and largest NaN value
* contained in the page, as defined by the IEEE 754 total order.

*/
2: required list<binary> min_values
3: required list<binary> max_values
Expand All @@ -1193,7 +1299,6 @@ struct ColumnIndex {
* null counts are 0.
*/
5: optional list<i64> null_counts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary whitespace change.

/**
* Contains repetition level histograms for each page
* concatenated together. The repetition_level_histogram field on
Expand All @@ -1211,6 +1316,16 @@ struct ColumnIndex {
* Same as repetition_level_histograms except for definitions levels.
**/
7: optional list<i64> definition_level_histograms;

/**
* A list containing the number of NaN values for each page. Only present
* for columns of physical type FLOAT or DOUBLE, or logical type FLOAT16.
* If this field is not present, readers MUST assume that there might or
* might not be NaN values in any page, as NaNs should not be included
* in min_values or max_values.
Comment on lines +1324 to +1325
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* might not be NaN values in any page, as NaNs should not be included
* in min_values or max_values.
* might not be NaN values in any page, as NaNs are not included
* in min_values or max_values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks imprecise to me. NaNs are included in min_values/max_values when IEEE754TotalOrder is used. So perhaps we should simply say that If this field is not present, readers MUST assume that there might be NaN values in any page?

*/
8: optional list<i64> nan_counts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that there was concern with introducing nan_counts to column index which is not applicable to all types. If we don't receive any objection during this round of discussion, can we conclude that we have consensus on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the trade-off for nan_counts in general: You add fields to thrift that are only relevant to floating point types. Let's see if there is any objection on this.


}

struct AesGcmV1 {
Expand Down