-
Notifications
You must be signed in to change notification settings - Fork 297
Current shuffle format has too much overhead with default batch size #3882
Description
Describe the bug
The current shuffle format writes each batch using the Arrow IPC Stream format, writing a single batch per stream instance, which means that the schema is encoded for each batch. There may also be overhead in creating a new compression codec for each batch.
In one example, we have seen that with the default batch size that Comet shuffle files are 50% larger than Spark shuffle files, and overall query performance was 10% slower than Spark. After doubling the batch size, Comet shuffle files were only 8% larger than Spark and performance was 15% faster than Spark.
Increasing the batch size consistently improves performance, but at the cost of downstream operators potentially using more memory, although we have not measured this.
Steps to reproduce
No response
Expected behavior
No response
Additional context
No response