Skip to content

[Bug] OOM when writing table with large records (100MB+) due to unbounded buffer growth in sort, merge and compaction paths #7620

@yugan95

Description

@yugan95

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

master (also affects 1.x releases)

Compute Engine

Spark 3.4 (engine-independent, core module issue)

Minimal reproduce step

Write a primary-key table with many buckets where some rows contain very large binary/string columns (e.g. 100MB+ per record). The table uses the default LSM-Tree storage with external sort and compaction enabled.

CREATE TABLE T (
    id INT NOT NULL,
    payload BYTES
) TBLPROPERTIES (
    'primary-key' = 'id',
    'bucket' = '256'
);

-- Insert rows where `payload` is ~100MB each, distributed across many buckets
INSERT INTO T VALUES (1, <100MB_bytes>), (2, <100MB_bytes>), ...;

After a few flush/compaction cycles, the TaskManager / Executor runs out of memory with OOM errors.

The key factor is multiple buckets: each bucket's writer independently holds its own sort buffer, merge channels, and compaction readers. When a large record inflates an internal reuse buffer, that bloated buffer is retained per-bucket. With 256 buckets × 100MB+ bloated buffers, memory usage quickly exceeds available heap.

What doesn't meet your expectations?

Heap dump analysis reveals four independent memory leak / overflow issues when handling large records:

1. Sort path — RowHelper internal buffer never shrinks

RowHelper.reuseWriter grows its internal MemorySegment list to accommodate large records (e.g. 100MB+), but BinaryRowWriter.reset() only resets the cursor without releasing the oversized segments. Since InternalRowSerializer.serialize() can exit via EOFException (a normal signal when the sort buffer is full), the bloated buffer is never released.

2. Merge path — BinaryRowSerializer.deserialize(reuse) only grows, never shrinks

During external merge sort, each merge channel holds a BinaryRow reuse instance. When a large record is deserialized, the backing MemorySegment grows to fit it. Subsequent small records reuse the oversized buffer. With max-num-file-handles (default 128) merge channels, each retaining a 100MB+ buffer, memory usage explodes.

3. Compaction read path — HeapBytesVector.reserveBytes() integer overflow

reserveBytes() computes newCapacity * 2 using plain multiplication. When newCapacity exceeds ~1.07 billion bytes, this overflows Integer.MAX_VALUE, producing a negative or zero value, which causes Arrays.copyOf() to throw NegativeArraySizeException or silently corrupt data.

4. Parquet write — statistics and page-size-check config not passed through

RowDataParquetBuilder does not pass through several important Parquet configuration properties:

  • parquet.statistics.truncate.length — controls truncation of min/max statistics. Defaults to Integer.MAX_VALUE, causing full 100MB+ values to be stored in column chunk metadata, ballooning the Parquet footer.
  • parquet.columnindex.truncate.length — same issue for column index entries.
  • parquet.page.size.row.check.min — minimum row count before checking page size. The default (100) means Parquet accumulates 100 rows before the first page-size check. For large records, this can cause a single page to balloon to several GB before being flushed.
  • parquet.page.size.row.check.max — maximum row count between page-size checks. The default (10000) is far too high for large-record workloads, delaying page flushes and causing excessive memory usage.

Without these configs being passed through, users have no way to tune Parquet's behavior for large-record scenarios.

Anything else?

Root cause analysis via heap dump:

Path Class Issue
Sort RowHelper reuseWriter segments grow but never shrink; EOFException exit path skips cleanup
Merge BinaryRowSerializer deserialize(reuse) only grows backing MemorySegment, never shrinks
Compaction HeapBytesVector newCapacity * 2 overflows Integer.MAX_VALUE
Parquet write RowDataParquetBuilder Statistics/column-index truncate length and page-size-check row counts not configurable

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions