Skip to content

Support Iceberg _spec_id metadata column in native scan #2217

@weimingdiit

Description

@weimingdiit

Is your feature request related to a problem? Please describe.

Currently, native Iceberg scan only supports the _file metadata column. Queries that project other Iceberg metadata columns fall back to Spark, even when the metadata column is file-level and can be materialized as a constant per scanned data file.

For example, _spec_id is a file-level Iceberg metadata column. It represents the partition spec ID of the data file containing the row. Unlike _pos, it does not require row-level materialization, so it can be supported in the same way as _file.

Related code:

  • IcebergScanSupport.isSupportedMetadataColumn only allows MetadataColumns.FILE_PATH.
  • NativeIcebergTableScanExec.metadataPartitionValues only materializes _file.

Describe the solution you'd like

Add native Iceberg scan support for the _spec_id metadata column.

The implementation can treat _spec_id as a per-file partition value:

  1. Mark MetadataColumns.SPEC_ID as a supported metadata column in IcebergScanSupport.
  2. Build a mapping from data file path to FileScanTask.file().specId().
  3. Extend NativeIcebergTableScanExec.metadataPartitionValues to materialize _spec_id as an integer literal.
  4. Add integration tests for queries such as:
    • select _spec_id from iceberg_table
    • select id, _file, _spec_id from iceberg_table

The native scan should continue to fall back for row-level metadata columns such as _pos.

Additional context

Iceberg defines _spec_id as a required integer metadata column. Since the value is constant for all rows in a data file, it can be passed through the existing native scan partition-value mechanism already used for _file.

This is a focused native Iceberg scan coverage improvement and should not require an AIP.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions