-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Open
Labels
Description
Describe the enhancement requested
Description
Similar to how getCurrentRowIndex() was introduced to expose the current row's file-level index, this adds getCurrentRowGroupIndex() to expose the index of the row group currently being read.
New API
ParquetFileReader.getCurrentRowGroupIndex()— returns the 0-based index of the last row group read viareadNextRowGroup() / readNextFilteredRowGroup(). Returns -1 before any row group has been read.ParquetReader.getCurrentRowGroupIndex()— same semantics, for the high-level record reader.ParquetRecordReader.getCurrentRowGroupIndex()— same, for the Hadoop MapReduce record reader.
The returned index is the actual file-level row group index, meaning it correctly reflects gaps when empty row groups are skipped (e.g. if row group 1 is empty, the indices reported will be 0, 2, ... not 0, 1, ...).
Motivation
Engines like Apache Spark need to know which row group a record belongs to — for example, to expose row group metadata as a hidden column, or to correlate records with row group-level statistics. Without this API, callers have no way to determine the current row group index during sequential reads.
Component(s)
No response
Reactions are currently unavailable