Skip to content

estimateRgEndOffset slop calculation is insufficient for incompressible data #2619

@thexiay

Description

@thexiay

Problem

The estimateRgEndOffset method in RecordReaderUtils.java uses a stretchFactor to estimate how much compressed data to read ahead for a row group. The current formula:

int stretchFactor = 2 + (MAX_VALUES_LENGTH * MAX_BYTE_WIDTH - 1) / bufferSize;

does not account for the 2-byte RLEv2 DIRECT run header. This means the worst-case uncompressed payload is actually MAX_VALUES_LENGTH * MAX_BYTE_WIDTH + 2 bytes (512 * 8 + 2 = 4098), not MAX_VALUES_LENGTH * MAX_BYTE_WIDTH (4096).

Impact

When data is incompressible (e.g., random bytes), each compression block expands to HEADER_SIZE + bufferSize bytes. With bufferSize = 1024, the old formula gives stretchFactor = 5, allocating space for 5 compressed blocks. However, 4098 bytes of uncompressed data requires ceil(4098 / 1024) = 5 blocks of payload, plus the initial 2 blocks from the base factor, totaling 6 blocks needed. The old estimate falls short by one block, causing IllegalArgumentException: Buffer size too small when reading a full RLE v2 DIRECT run at the estimated boundary.

Fix

Include the RLEv2 header size in the worst-case calculation:

int maxRleDirectRunSize = MAX_VALUES_LENGTH * MAX_BYTE_WIDTH + 2;
int stretchFactor = 2 + (maxRleDirectRunSize - 1) / bufferSize;

This correctly yields stretchFactor = 6 for bufferSize = 1024, ensuring enough space is allocated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions