Is your feature request related to a problem? Please describe.
Currently, native Iceberg scan only supports the _file metadata column. Queries that project other Iceberg metadata columns fall back to Spark, even when the metadata column is file-level and can be materialized as a constant per scanned data file.
For example, _spec_id is a file-level Iceberg metadata column. It represents the partition spec ID of the data file containing the row. Unlike _pos, it does not require row-level materialization, so it can be supported in the same way as _file.
Related code:
IcebergScanSupport.isSupportedMetadataColumn only allows MetadataColumns.FILE_PATH.
NativeIcebergTableScanExec.metadataPartitionValues only materializes _file.
Describe the solution you'd like
Add native Iceberg scan support for the _spec_id metadata column.
The implementation can treat _spec_id as a per-file partition value:
- Mark
MetadataColumns.SPEC_ID as a supported metadata column in IcebergScanSupport.
- Build a mapping from data file path to
FileScanTask.file().specId().
- Extend
NativeIcebergTableScanExec.metadataPartitionValues to materialize _spec_id as an integer literal.
- Add integration tests for queries such as:
select _spec_id from iceberg_table
select id, _file, _spec_id from iceberg_table
The native scan should continue to fall back for row-level metadata columns such as _pos.
Additional context
Iceberg defines _spec_id as a required integer metadata column. Since the value is constant for all rows in a data file, it can be passed through the existing native scan partition-value mechanism already used for _file.
This is a focused native Iceberg scan coverage improvement and should not require an AIP.
Is your feature request related to a problem? Please describe.
Currently, native Iceberg scan only supports the
_filemetadata column. Queries that project other Iceberg metadata columns fall back to Spark, even when the metadata column is file-level and can be materialized as a constant per scanned data file.For example,
_spec_idis a file-level Iceberg metadata column. It represents the partition spec ID of the data file containing the row. Unlike_pos, it does not require row-level materialization, so it can be supported in the same way as_file.Related code:
IcebergScanSupport.isSupportedMetadataColumnonly allowsMetadataColumns.FILE_PATH.NativeIcebergTableScanExec.metadataPartitionValuesonly materializes_file.Describe the solution you'd like
Add native Iceberg scan support for the
_spec_idmetadata column.The implementation can treat
_spec_idas a per-file partition value:MetadataColumns.SPEC_IDas a supported metadata column inIcebergScanSupport.FileScanTask.file().specId().NativeIcebergTableScanExec.metadataPartitionValuesto materialize_spec_idas an integer literal.select _spec_id from iceberg_tableselect id, _file, _spec_id from iceberg_tableThe native scan should continue to fall back for row-level metadata columns such as
_pos.Additional context
Iceberg defines
_spec_idas a required integer metadata column. Since the value is constant for all rows in a data file, it can be passed through the existing native scan partition-value mechanism already used for_file.This is a focused native Iceberg scan coverage improvement and should not require an AIP.