Skip to content

Conversation

@alkis
Copy link
Contributor

@alkis alkis commented May 29, 2024

  1. Add min8/max8 fields for encoding fixed length binary encoding for min/max for physical types less than or equal 8 bytes.
  2. Deprecate ColumnMetaData.encoding_stats and replace with a bool ColumnMetaData.is_fully_dict_encoded'

ref Parquet Metadata evolution

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

* Only one pair of max_value/min_value, max1/min1, max2/min2, max4/min4,
* max8/min8 can be set. The pair is determined by the physical type of the
* column. Floating point values are bitcasted to integers. Variable length
* values are set in min_value/max_value.
Copy link
Contributor

@emkornfield emkornfield May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update the docs for readers for backwards compatibility should check min_value/max_value if the non-variable width field is not not set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten this to be clearer.

7: optional bool is_max_value_exact;
/** If true, min_value is the actual minimum value for a column */
8: optional bool is_min_value_exact;
9: optional i64 max8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you intentionally elide min1/max1? (they are still mentioned above).

Copy link
Contributor Author

@alkis alkis May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I removed them because they provide little benefit and do not justify the added complexity. This is because in thrift these are ulebs so it makes no difference in the wire. For flatbuffers this would make a difference though.


/** Set of all encodings used for pages in this column chunk.
/**
* DEPRECATED: use is_fully_dict_encoded instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest making this a separate PR, I think we'd prefer to keep the changes as small and focused as possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I will create another PR for the Statistics change alone if we are OK merging that now.

…tats

1. Add `min8`/`max8` fields for encoding fixed length binary encoding for min/max for physical types less than or equal 8 bytes.
2. Deprecate `ColumnMetaData.encoding_stats` and replace with a bool `ColumnMetaData.is_fully_dict_encoded`
@alkis alkis force-pushed the t1-metadata-improvements branch from b2caa21 to ec13f34 Compare May 30, 2024 09:27
@alkis alkis changed the title [T1] Allow fixed length encoding for min/max [T1] Allow fixed length encoding for min/max and deprecate encoding_stats May 30, 2024
* the columns ColumnOrder
* max_value/min_value: PLAIN encoded values, sans length prefix if varlen
* max8/min8: up to 8-bytes:
* FLOAT, DOUBLE: bitcasted to INT32 and INT64, respectively
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to be more specific here about values less then 8 bytes are translated into 8 bytes. In practice it doesn't make a difference for readers but it would be good to limit ambiguity. I assume we do a normal cast from 1/4 integer byte values to 8 bytes values rather then just embedding them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants