Problem
The estimateRgEndOffset method in RecordReaderUtils.java uses a stretchFactor to estimate how much compressed data to read ahead for a row group. The current formula:
int stretchFactor = 2 + (MAX_VALUES_LENGTH * MAX_BYTE_WIDTH - 1) / bufferSize;
does not account for the 2-byte RLEv2 DIRECT run header. This means the worst-case uncompressed payload is actually MAX_VALUES_LENGTH * MAX_BYTE_WIDTH + 2 bytes (512 * 8 + 2 = 4098), not MAX_VALUES_LENGTH * MAX_BYTE_WIDTH (4096).
Impact
When data is incompressible (e.g., random bytes), each compression block expands to HEADER_SIZE + bufferSize bytes. With bufferSize = 1024, the old formula gives stretchFactor = 5, allocating space for 5 compressed blocks. However, 4098 bytes of uncompressed data requires ceil(4098 / 1024) = 5 blocks of payload, plus the initial 2 blocks from the base factor, totaling 6 blocks needed. The old estimate falls short by one block, causing IllegalArgumentException: Buffer size too small when reading a full RLE v2 DIRECT run at the estimated boundary.
Fix
Include the RLEv2 header size in the worst-case calculation:
int maxRleDirectRunSize = MAX_VALUES_LENGTH * MAX_BYTE_WIDTH + 2;
int stretchFactor = 2 + (maxRleDirectRunSize - 1) / bufferSize;
This correctly yields stretchFactor = 6 for bufferSize = 1024, ensuring enough space is allocated.
Problem
The
estimateRgEndOffsetmethod inRecordReaderUtils.javauses astretchFactorto estimate how much compressed data to read ahead for a row group. The current formula:does not account for the 2-byte RLEv2 DIRECT run header. This means the worst-case uncompressed payload is actually
MAX_VALUES_LENGTH * MAX_BYTE_WIDTH + 2bytes (512 * 8 + 2 = 4098), notMAX_VALUES_LENGTH * MAX_BYTE_WIDTH(4096).Impact
When data is incompressible (e.g., random bytes), each compression block expands to
HEADER_SIZE + bufferSizebytes. WithbufferSize = 1024, the old formula givesstretchFactor = 5, allocating space for 5 compressed blocks. However, 4098 bytes of uncompressed data requiresceil(4098 / 1024) = 5blocks of payload, plus the initial 2 blocks from the base factor, totaling 6 blocks needed. The old estimate falls short by one block, causingIllegalArgumentException: Buffer size too smallwhen reading a full RLE v2 DIRECT run at the estimated boundary.Fix
Include the RLEv2 header size in the worst-case calculation:
This correctly yields
stretchFactor = 6forbufferSize = 1024, ensuring enough space is allocated.