Skip to content

Question regarding Parquet Page Index: Why enable it during write if it's not utilized during read? #814

@Patzifist

Description

@Patzifist

Describe the usage question you have. Please include as many useful details as possible.

Hi team,

I’m currently configuring parquet-go (v18) for high-performance data ingestion and I have a question regarding the utility of the Page Index (Column Index / Offset Index).

In my current setup, I see that PageIndexEnabled can be toggled in WriterProperties. However, after digging into the arrow-go reader and scanner implementations, I couldn't find clear evidence that the Page Index is being used to perform page-level skipping during queries.

Questions:

  1. Read-side support: Does the current arrow-go Parquet reader or the higher-level Scanner API actually implement page-level pruning using the Page Index? Or is filtering still limited to Row Group boundaries?
  2. Writing strategy: If the Go reader doesn't support it yet, is there any reason to enable it during the write phase other than compatibility with external engines (like Spark or Trino)?
  3. Overhead: Are there any significant performance penalties when writing files with Page Index enabled in a Go-centric environment, given the extra metadata management?

I want to avoid including "dead weight" metadata in my files if it doesn't provide any performance benefits within the Go ecosystem.

Looking forward to your clarification.

Component(s)

Parquet

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions