Skip to content

ORC-2619: Fix estimateRgEndOffset slop calculation for incompressible data#2620

Open
thexiay wants to merge 1 commit intoapache:mainfrom
thexiay:fix/estimate-rg-end-offset-slop
Open

ORC-2619: Fix estimateRgEndOffset slop calculation for incompressible data#2620
thexiay wants to merge 1 commit intoapache:mainfrom
thexiay:fix/estimate-rg-end-offset-slop

Conversation

@thexiay
Copy link
Copy Markdown

@thexiay thexiay commented May 7, 2026

What changes were proposed in this pull request?

Fix the estimateRgEndOffset slop calculation in RecordReaderUtils.java to account for the 2-byte RLEv2 DIRECT run header.

Problem

The old formula:

int stretchFactor = 2 + (MAX_VALUES_LENGTH * MAX_BYTE_WIDTH - 1) / bufferSize;

only considers the value payload (512 * 8 = 4096 bytes) but ignores the 2-byte RLE header. For bufferSize = 1024, this gives stretchFactor = 5, which is one block short when data is incompressible.

Fix

int maxRleDirectRunSize = MAX_VALUES_LENGTH * MAX_BYTE_WIDTH + 2;
int stretchFactor = 2 + (maxRleDirectRunSize - 1) / bufferSize;

This correctly yields stretchFactor = 6, ensuring enough compressed blocks are allocated.

How was this patch tested?

Added testTruncatedRleV2DirectRunAtEstimatedEndFails in TestInStream.java that:

  1. Creates a compressed stream with incompressible (random) data
  2. Truncates it at the old estimated end offset
  3. Verifies that reading a full RLE v2 DIRECT run fails with IllegalArgumentException: Buffer size too small

This proves the old slop estimation was insufficient.

Closes #2619

@thexiay thexiay force-pushed the fix/estimate-rg-end-offset-slop branch from a9ffd5d to c52545c Compare May 8, 2026 03:06
… data

The stretchFactor calculation in estimateRgEndOffset did not account for
the 2-byte RLEv2 DIRECT run header. This caused insufficient buffer
allocation when data is incompressible, leading to 'Buffer size too small'
errors.

Fix: Include RLE_V2_HEADER_SIZE in the worst-case payload calculation.
Add test demonstrating the issue with the old formula.
@thexiay thexiay force-pushed the fix/estimate-rg-end-offset-slop branch from c52545c to 076e787 Compare May 8, 2026 10:01
@cxzl25 cxzl25 requested a review from Copilot May 9, 2026 11:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes RecordReaderUtils.estimateRgEndOffset slop estimation for compressed streams so worst-case RLEv2 DIRECT runs (including the 2-byte run header) are covered, preventing under-reading near row group boundaries on incompressible data.

Changes:

  • Update estimateRgEndOffset worst-case sizing to include the RLEv2 DIRECT run header.
  • Add/adjust tests to exercise the old vs new slop behavior and row-group seeking scenarios.
  • Introduce RLE_V2_HEADER_SIZE constant and update one existing test formula to match the new sizing.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java Adjust slop/stretch-factor math for compressed streams; add RLEv2 header-size constant.
java/core/src/test/org/apache/orc/impl/TestInStream.java Add regression test intended to demonstrate failure with the old slop estimate when truncated.
java/core/src/test/org/apache/orc/impl/TestRecordReaderImpl.java Update test’s local slop computation to include the RLEv2 header size.
java/core/src/test/org/apache/orc/impl/TestOrcLargeStripe.java Add integration-style test covering seeking/filtering across row-group boundaries with ZLIB.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +220 to +222
// RLEv2 DIRECT runs can need a 2-byte header in addition to their value payload.
int maxRleDirectRunSize = MAX_VALUES_LENGTH * MAX_BYTE_WIDTH + 2;
int stretchFactor = 2 + (maxRleDirectRunSize - 1) / bufferSize;
static final int MAX_BYTE_WIDTH =
SerializationUtils.decodeBitWidth(SerializationUtils.FixedBitSizes.SIXTYFOUR.ordinal()) / 8;
// RLEv2 DIRECT run header size in bytes
public static final int RLE_V2_HEADER_SIZE = 2;
Comment on lines +1018 to +1022
final int chunkSize = OutStream.HEADER_SIZE + bufferSize;
final int nextGroupOffset = bufferSize;
final int oldStretchFactor =
2 + (MAX_VALUES_LENGTH * MAX_BYTE_WIDTH - 1) / bufferSize;
final int oldEstimatedEnd = nextGroupOffset + oldStretchFactor * chunkSize;
Comment on lines +1051 to +1052
offset += stream.read(
rleDirectRun, offset, rleDirectRun.length - offset);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

estimateRgEndOffset slop calculation is insufficient for incompressible data

2 participants