[Bug] OOM when writing table with large records (100MB+) due to unbounded buffer growth in sort, merge and compaction paths

### Search before asking

- [x] I searched in the [issues](https://github.com/apache/paimon/issues) and found nothing similar.


### Paimon version

master (also affects 1.x releases)

### Compute Engine

Spark 3.4 (engine-independent, core module issue)

### Minimal reproduce step

Write a primary-key table with **many buckets** where some rows contain very large binary/string columns (e.g. 100MB+ per record). The table uses the default LSM-Tree storage with external sort and compaction enabled.

``` sql
CREATE TABLE T (
    id INT NOT NULL,
    payload BYTES
) TBLPROPERTIES (
    'primary-key' = 'id',
    'bucket' = '256'
);

-- Insert rows where `payload` is ~100MB each, distributed across many buckets
INSERT INTO T VALUES (1, <100MB_bytes>), (2, <100MB_bytes>), ...;
```
After a few flush/compaction cycles, the TaskManager / Executor runs out of memory with OOM errors.

The key factor is **multiple buckets**: each bucket's writer independently holds its own sort buffer, merge channels, and compaction readers. When a large record inflates an internal reuse buffer, that bloated buffer is retained per-bucket. With 256 buckets × 100MB+ bloated buffers, memory usage quickly exceeds available heap.

### What doesn't meet your expectations?

Heap dump analysis reveals **four independent memory leak / overflow issues** when handling large records:

**1. Sort path — `RowHelper` internal buffer never shrinks**

`RowHelper.reuseWriter` grows its internal `MemorySegment` list to accommodate large records (e.g. 100MB+), but `BinaryRowWriter.reset()` only resets the cursor without releasing the oversized segments. Since `InternalRowSerializer.serialize()` can exit via `EOFException` (a normal signal when the sort buffer is full), the bloated buffer is never released.

**2. Merge path — `BinaryRowSerializer.deserialize(reuse)` only grows, never shrinks**

During external merge sort, each merge channel holds a `BinaryRow` reuse instance. When a large record is deserialized, the backing `MemorySegment` grows to fit it. Subsequent small records reuse the oversized buffer. With `max-num-file-handles` (default 128) merge channels, each retaining a 100MB+ buffer, memory usage explodes.

**3. Compaction read path — `HeapBytesVector.reserveBytes()` integer overflow**

`reserveBytes()` computes `newCapacity * 2` using plain multiplication. When `newCapacity` exceeds ~1.07 billion bytes, this overflows `Integer.MAX_VALUE`, producing a negative or zero value, which causes `Arrays.copyOf()` to throw `NegativeArraySizeException` or silently corrupt data.

**4. Parquet write — statistics and page-size-check config not passed through**

`RowDataParquetBuilder` does not pass through several important Parquet configuration properties:

- **`parquet.statistics.truncate.length`** — controls truncation of min/max statistics. Defaults to `Integer.MAX_VALUE`, causing full 100MB+ values to be stored in column chunk metadata, ballooning the Parquet footer.
- **`parquet.columnindex.truncate.length`** — same issue for column index entries.
- **`parquet.page.size.row.check.min`** — minimum row count before checking page size. The default (100) means Parquet accumulates 100 rows before the first page-size check. For large records, this can cause a single page to balloon to several GB before being flushed.
- **`parquet.page.size.row.check.max`** — maximum row count between page-size checks. The default (10000) is far too high for large-record workloads, delaying page flushes and causing excessive memory usage.

Without these configs being passed through, users have no way to tune Parquet's behavior for large-record scenarios.

### Anything else?

**Root cause analysis via heap dump:**

| Path | Class | Issue |
|---|---|---|
| Sort | `RowHelper` | `reuseWriter` segments grow but never shrink; `EOFException` exit path skips cleanup |
| Merge | `BinaryRowSerializer` | `deserialize(reuse)` only grows backing `MemorySegment`, never shrinks |
| Compaction | `HeapBytesVector` | `newCapacity * 2` overflows `Integer.MAX_VALUE` |
| Parquet write | `RowDataParquetBuilder` | Statistics/column-index truncate length and page-size-check row counts not configurable |

### Are you willing to submit a PR?

- [x] I'm willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] OOM when writing table with large records (100MB+) due to unbounded buffer growth in sort, merge and compaction paths #7620

Search before asking

Paimon version

Compute Engine

Minimal reproduce step

What doesn't meet your expectations?

Anything else?

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Path	Class	Issue
Sort	`RowHelper`	`reuseWriter` segments grow but never shrink; `EOFException` exit path skips cleanup
Merge	`BinaryRowSerializer`	`deserialize(reuse)` only grows backing `MemorySegment`, never shrinks
Compaction	`HeapBytesVector`	`newCapacity * 2` overflows `Integer.MAX_VALUE`
Parquet write	`RowDataParquetBuilder`	Statistics/column-index truncate length and page-size-check row counts not configurable

[Bug] OOM when writing table with large records (100MB+) due to unbounded buffer growth in sort, merge and compaction paths #7620

Description

Search before asking

Paimon version

Compute Engine

Minimal reproduce step

What doesn't meet your expectations?

Anything else?

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions