Search before asking
Paimon version
master (also affects 1.x releases)
Compute Engine
Spark 3.4 (engine-independent, core module issue)
Minimal reproduce step
Write a primary-key table with many buckets where some rows contain very large binary/string columns (e.g. 100MB+ per record). The table uses the default LSM-Tree storage with external sort and compaction enabled.
CREATE TABLE T (
id INT NOT NULL,
payload BYTES
) TBLPROPERTIES (
'primary-key' = 'id',
'bucket' = '256'
);
-- Insert rows where `payload` is ~100MB each, distributed across many buckets
INSERT INTO T VALUES (1, <100MB_bytes>), (2, <100MB_bytes>), ...;
After a few flush/compaction cycles, the TaskManager / Executor runs out of memory with OOM errors.
The key factor is multiple buckets: each bucket's writer independently holds its own sort buffer, merge channels, and compaction readers. When a large record inflates an internal reuse buffer, that bloated buffer is retained per-bucket. With 256 buckets × 100MB+ bloated buffers, memory usage quickly exceeds available heap.
What doesn't meet your expectations?
Heap dump analysis reveals four independent memory leak / overflow issues when handling large records:
1. Sort path — RowHelper internal buffer never shrinks
RowHelper.reuseWriter grows its internal MemorySegment list to accommodate large records (e.g. 100MB+), but BinaryRowWriter.reset() only resets the cursor without releasing the oversized segments. Since InternalRowSerializer.serialize() can exit via EOFException (a normal signal when the sort buffer is full), the bloated buffer is never released.
2. Merge path — BinaryRowSerializer.deserialize(reuse) only grows, never shrinks
During external merge sort, each merge channel holds a BinaryRow reuse instance. When a large record is deserialized, the backing MemorySegment grows to fit it. Subsequent small records reuse the oversized buffer. With max-num-file-handles (default 128) merge channels, each retaining a 100MB+ buffer, memory usage explodes.
3. Compaction read path — HeapBytesVector.reserveBytes() integer overflow
reserveBytes() computes newCapacity * 2 using plain multiplication. When newCapacity exceeds ~1.07 billion bytes, this overflows Integer.MAX_VALUE, producing a negative or zero value, which causes Arrays.copyOf() to throw NegativeArraySizeException or silently corrupt data.
4. Parquet write — statistics and page-size-check config not passed through
RowDataParquetBuilder does not pass through several important Parquet configuration properties:
parquet.statistics.truncate.length — controls truncation of min/max statistics. Defaults to Integer.MAX_VALUE, causing full 100MB+ values to be stored in column chunk metadata, ballooning the Parquet footer.
parquet.columnindex.truncate.length — same issue for column index entries.
parquet.page.size.row.check.min — minimum row count before checking page size. The default (100) means Parquet accumulates 100 rows before the first page-size check. For large records, this can cause a single page to balloon to several GB before being flushed.
parquet.page.size.row.check.max — maximum row count between page-size checks. The default (10000) is far too high for large-record workloads, delaying page flushes and causing excessive memory usage.
Without these configs being passed through, users have no way to tune Parquet's behavior for large-record scenarios.
Anything else?
Root cause analysis via heap dump:
| Path |
Class |
Issue |
| Sort |
RowHelper |
reuseWriter segments grow but never shrink; EOFException exit path skips cleanup |
| Merge |
BinaryRowSerializer |
deserialize(reuse) only grows backing MemorySegment, never shrinks |
| Compaction |
HeapBytesVector |
newCapacity * 2 overflows Integer.MAX_VALUE |
| Parquet write |
RowDataParquetBuilder |
Statistics/column-index truncate length and page-size-check row counts not configurable |
Are you willing to submit a PR?
Search before asking
Paimon version
master (also affects 1.x releases)
Compute Engine
Spark 3.4 (engine-independent, core module issue)
Minimal reproduce step
Write a primary-key table with many buckets where some rows contain very large binary/string columns (e.g. 100MB+ per record). The table uses the default LSM-Tree storage with external sort and compaction enabled.
After a few flush/compaction cycles, the TaskManager / Executor runs out of memory with OOM errors.
The key factor is multiple buckets: each bucket's writer independently holds its own sort buffer, merge channels, and compaction readers. When a large record inflates an internal reuse buffer, that bloated buffer is retained per-bucket. With 256 buckets × 100MB+ bloated buffers, memory usage quickly exceeds available heap.
What doesn't meet your expectations?
Heap dump analysis reveals four independent memory leak / overflow issues when handling large records:
1. Sort path —
RowHelperinternal buffer never shrinksRowHelper.reuseWritergrows its internalMemorySegmentlist to accommodate large records (e.g. 100MB+), butBinaryRowWriter.reset()only resets the cursor without releasing the oversized segments. SinceInternalRowSerializer.serialize()can exit viaEOFException(a normal signal when the sort buffer is full), the bloated buffer is never released.2. Merge path —
BinaryRowSerializer.deserialize(reuse)only grows, never shrinksDuring external merge sort, each merge channel holds a
BinaryRowreuse instance. When a large record is deserialized, the backingMemorySegmentgrows to fit it. Subsequent small records reuse the oversized buffer. Withmax-num-file-handles(default 128) merge channels, each retaining a 100MB+ buffer, memory usage explodes.3. Compaction read path —
HeapBytesVector.reserveBytes()integer overflowreserveBytes()computesnewCapacity * 2using plain multiplication. WhennewCapacityexceeds ~1.07 billion bytes, this overflowsInteger.MAX_VALUE, producing a negative or zero value, which causesArrays.copyOf()to throwNegativeArraySizeExceptionor silently corrupt data.4. Parquet write — statistics and page-size-check config not passed through
RowDataParquetBuilderdoes not pass through several important Parquet configuration properties:parquet.statistics.truncate.length— controls truncation of min/max statistics. Defaults toInteger.MAX_VALUE, causing full 100MB+ values to be stored in column chunk metadata, ballooning the Parquet footer.parquet.columnindex.truncate.length— same issue for column index entries.parquet.page.size.row.check.min— minimum row count before checking page size. The default (100) means Parquet accumulates 100 rows before the first page-size check. For large records, this can cause a single page to balloon to several GB before being flushed.parquet.page.size.row.check.max— maximum row count between page-size checks. The default (10000) is far too high for large-record workloads, delaying page flushes and causing excessive memory usage.Without these configs being passed through, users have no way to tune Parquet's behavior for large-record scenarios.
Anything else?
Root cause analysis via heap dump:
RowHelperreuseWritersegments grow but never shrink;EOFExceptionexit path skips cleanupBinaryRowSerializerdeserialize(reuse)only grows backingMemorySegment, never shrinksHeapBytesVectornewCapacity * 2overflowsInteger.MAX_VALUERowDataParquetBuilderAre you willing to submit a PR?