Current shuffle format has too much overhead with default batch size

### Describe the bug

The current shuffle format writes each batch using the Arrow IPC Stream format, writing a single batch per stream instance, which means that the schema is encoded for each batch. There may also be overhead in creating a new compression codec for each batch.

In one example, we have seen that with the default batch size that Comet shuffle files are 50% larger than Spark shuffle files, and overall query performance was 10% slower than Spark. After doubling the batch size, Comet shuffle files were only 8% larger than Spark and performance was 15% faster than Spark.

Increasing the batch size consistently improves performance, but at the cost of downstream operators potentially using more memory, although we have not measured this.

### Steps to reproduce

_No response_

### Expected behavior

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current shuffle format has too much overhead with default batch size #3882

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Current shuffle format has too much overhead with default batch size #3882

Description

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions