-
Notifications
You must be signed in to change notification settings - Fork 461
GH-541: Document status of file_path #542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
alkis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a row to https://parquet.apache.org/docs/file-format/implementationstatus/ for this?
src/main/thrift/parquet.thrift
Outdated
| * addition process. CONTRIBUTING.md in the parquet-format repository | ||
| * provides details on the process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I wouldn't mention CONTRIBUTING.md here - seems something that can be found easily without explicit mention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @emkornfield
src/main/thrift/parquet.thrift
Outdated
| * addition process. CONTRIBUTING.md in the parquet-format repository | ||
| * provides details on the process. | ||
| * | ||
| * One known use-case for this field is to batch parquet footers together |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this somewhat confusing as above it says there are no implementations that make use of this implementation but then this paragraph explains a usecase.
Maybe we could say something like
As of December 2025, there are no known open source Parquet implementations that support
reading external data via this field. Some query engines (which ones??) make use of this field
to batch can batch parquet footers together into a single file that serve as an index, but this is
not common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrased above, starting with the known use-case.
I think we need to update implementation status phrasing to indicate the "_metadata" use case is what is meant. |
|
apache/parquet-site#145 to update implementation status |
Co-authored-by: Daniel Weeks <daniel.weeks@databricks.com>
src/main/thrift/parquet.thrift
Outdated
| * Writers should not populate this field except for in parquet summary files. Readers | ||
| * should ensure this field is empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These statements are effectively removing it from the spec, which I feel is too strong of a position for this clarification. I think it's fair to say that "readers should validate this field is empty if they do not support reading external column data", but prohibiting it is not a clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed this text entirely. I think this is mostly covered by the having new use go through a formal proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this was also referring to below as well. But I updated some language above this specific sentence in the paragraph above to also clarify that using this for reading external columns is not considered part of the specification (which is reflected by current implementation status).
remove guidance on populating
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @emkornfield -- it is really nice to see the spec being clarified
| * metadata. This path is relative to the current file. | ||
| * | ||
| * As of December 2025, the only known use-case for this field is writing summary | ||
| * parquet files (i.e. "_metadata" files). These files consolidate footers from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a link that describes what a summary file is and what implementations support it?
This is what came back from a quick google search: https://stackoverflow.com/questions/53150801/what-is-the-parquet-summary-file
But I didn't see any mention of this in the format repository: https://github.com/search?q=repo%3Aapache%2Fparquet-format%20summary&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this was ever officially part of the parquet specification as far as I can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworded this section.
src/main/thrift/parquet.thrift
Outdated
| * This is legacy feature as modern table formats (e.g. Iceberg, Hudi and Delta Lake) | ||
| * are more scalable and serve effectively the same purpose. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem to me that calling this "legacy" may be too opinionated -- maybe we could tone down the language with something like
Note that table formats (e.g. Iceberg, Hudi and Delta Lake) offer a
superset of this functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is at attempt to summarize this thread: https://lists.apache.org/thread/ootf2kmyg3p01b1bvplpvp4ftd1bt72d
It seems like there are potential correctness issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added this to the text.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me -- thank you @emkornfield
Rationale for this change
file_pathdoes not have any proper implementations and we should clarify its current usage, and that futureusage must go through a formal feature addition process.
What changes are included in this PR?
Clarification on the
file_pathfield status.Do these changes have PoC implementations?
No.
Closes #541