Skip to content

Conversation

@emkornfield
Copy link
Contributor

Rationale for this change

file_path does not have any proper implementations and we should clarify its current usage, and that future
usage must go through a formal feature addition process.

What changes are included in this PR?

Clarification on the file_path field status.

Do these changes have PoC implementations?

No.

Closes #541

Copy link
Contributor

@alkis alkis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 970 to 971
* addition process. CONTRIBUTING.md in the parquet-format repository
* provides details on the process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wouldn't mention CONTRIBUTING.md here - seems something that can be found easily without explicit mention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @emkornfield

* addition process. CONTRIBUTING.md in the parquet-format repository
* provides details on the process.
*
* One known use-case for this field is to batch parquet footers together
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this somewhat confusing as above it says there are no implementations that make use of this implementation but then this paragraph explains a usecase.

Maybe we could say something like

As of December 2025, there are no known open source Parquet implementations that support
reading external data via this field. Some query engines (which ones??) make use of this field
to batch can batch parquet footers together into a single file that serve as an index, but this is
not common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased above, starting with the known use-case.

@emkornfield
Copy link
Contributor Author

Should we add a row to https://parquet.apache.org/docs/file-format/implementationstatus/ for this?

I think we need to update implementation status phrasing to indicate the "_metadata" use case is what is meant.

@emkornfield
Copy link
Contributor Author

apache/parquet-site#145 to update implementation status

Co-authored-by: Daniel Weeks <daniel.weeks@databricks.com>
Comment on lines 973 to 974
* Writers should not populate this field except for in parquet summary files. Readers
* should ensure this field is empty.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These statements are effectively removing it from the spec, which I feel is too strong of a position for this clarification. I think it's fair to say that "readers should validate this field is empty if they do not support reading external column data", but prohibiting it is not a clarification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed this text entirely. I think this is mostly covered by the having new use go through a formal proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this was also referring to below as well. But I updated some language above this specific sentence in the paragraph above to also clarify that using this for reading external columns is not considered part of the specification (which is reflected by current implementation status).

remove guidance on populating
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @emkornfield -- it is really nice to see the spec being clarified

* metadata. This path is relative to the current file.
*
* As of December 2025, the only known use-case for this field is writing summary
* parquet files (i.e. "_metadata" files). These files consolidate footers from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a link that describes what a summary file is and what implementations support it?

This is what came back from a quick google search: https://stackoverflow.com/questions/53150801/what-is-the-parquet-summary-file

But I didn't see any mention of this in the format repository: https://github.com/search?q=repo%3Aapache%2Fparquet-format%20summary&type=code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this was ever officially part of the parquet specification as far as I can tell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded this section.

Comment on lines 966 to 967
* This is legacy feature as modern table formats (e.g. Iceberg, Hudi and Delta Lake)
* are more scalable and serve effectively the same purpose.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem to me that calling this "legacy" may be too opinionated -- maybe we could tone down the language with something like

Note that table formats (e.g. Iceberg, Hudi and Delta Lake) offer a
superset of this functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at attempt to summarize this thread: https://lists.apache.org/thread/ootf2kmyg3p01b1bvplpvp4ftd1bt72d

It seems like there are potential correctness issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this to the text.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me -- thank you @emkornfield

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clarify file_path status

4 participants