feat(write): add write pipeline with DataFusion INSERT INTO/OVERWRITE support by JingsongLi · Pull Request #234 · apache/paimon-rust

JingsongLi · 2026-04-11T02:43:41Z

Purpose

Subtask of #232

Add TableWrite for writing Arrow RecordBatches to Paimon append-only tables. Each (partition, bucket) pair gets its own DataFileWriter with direct writes (matching delta-rs DeltaWriter pattern). File rolling uses tokio::spawn for background close, and prepare_commit uses try_join_all for parallel finalization across partition writers.

Key components:

TableWrite: routes batches by partition/bucket, holds DataFileWriters
DataFileWriter: manages parquet file lifecycle with rolling support
WriteBuilder: creates TableWrite and TableCommit instances
PaimonDataSink: DataFusion DataSink integration for INSERT/OVERWRITE
FormatFileWriter: extended with flush() and in_progress_size()

Configurable options via CoreOptions:

file.compression (default: zstd)
target-file-size (default: 256MB)
write.parquet-buffer-size (default: 256MB)

Includes E2E integration tests for unpartitioned, partitioned, fixed-bucket, multi-commit, column projection, and bucket filtering.

Brief change log

Tests

API and Format

Documentation

littlecoder04 · 2026-04-11T15:52:02Z

crates/paimon/src/table/table_commit.rs

+                let row = BinaryRow::from_serialized_bytes(&msg.partition)?;
+                let mut spec = HashMap::new();
+                for (i, key) in partition_keys.iter().enumerate() {
+                    if let Some(datum) = extract_datum(&row, i, &data_types[i])? {


This will drop NULL partition keys from the overwrite predicate. I reproduced a case where overwriting the NULL partition also deletes other partitions.

Good catch.

… support Add TableWrite for writing Arrow RecordBatches to Paimon append-only tables. Each (partition, bucket) pair gets its own DataFileWriter with direct writes (matching delta-rs DeltaWriter pattern). File rolling uses tokio::spawn for background close, and prepare_commit uses try_join_all for parallel finalization across partition writers. Key components: - TableWrite: routes batches by partition/bucket, holds DataFileWriters - DataFileWriter: manages parquet file lifecycle with rolling support - WriteBuilder: creates TableWrite and TableCommit instances - PaimonDataSink: DataFusion DataSink integration for INSERT/OVERWRITE - FormatFileWriter: extended with flush() and in_progress_size() Configurable options via CoreOptions: - file.compression (default: zstd) - target-file-size (default: 256MB) - write.parquet-buffer-size (default: 256MB) Includes E2E integration tests for unpartitioned, partitioned, fixed-bucket, multi-commit, column projection, and bucket filtering.

littlecoder04 · 2026-04-12T03:23:53Z

crates/paimon/src/table/table_write.rs

+                let datum = extract_datum_from_arrow(batch, row_idx, field_idx, field.data_type())?;
+                if let Some(d) = datum {
+                    datums.push((d, field.data_type().clone()));
+                }


This will drop NULL bucket-key fields before hashing. Java preserves NULL positions here; see FixedBucketRowKeyExtractorTest.testUnCompactDecimalAndTimestampNullValueBucketNumber.
https://github.com/apache/paimon/blob/master/paimon-core/src/test/java/org/apache/paimon/table/sink/FixedBucketRowKeyExtractorTest.java

Good point! Also fix bucket Null values in TableScan.

littlecoder04 · 2026-04-12T11:05:44Z

+1

leaves12138

Solid write pipeline implementation. The architecture mirrors the paimon-python design well and the delta-rs style direct-write pattern is a good fit.

Highlights:

TableWrite + DataFileWriter: Clean per (partition, bucket) writer model. The divide_by_partition_bucket routing via arrow_select::take is correct for now. Background file close via JoinSet in roll_file() and parallel prepare_commit with try_join_all are well thought out.
PaimonDataSink: Proper DataSink implementation with write_all for INSERT and overwrite support. The dynamic partition predicate extraction from commit messages for OVERWRITE is the right approach.
TableCommit refactoring: Splitting into explicit commit() (APPEND) and overwrite() (dynamic partition overwrite) methods is cleaner than the implicit overwrite_partition constructor arg.
Integration tests: Comprehensive E2E coverage — unpartitioned, partitioned, fixed bucket, multi-commit, column projection, bucket filtering.
CoreOptions additions: file.compression, target-file-size, write.parquet-buffer-size are the right knobs to expose.

Minor notes (non-blocking):

divide_by_partition_bucket creates one UInt32Array of indices per row. For large batches this could be optimized with batch-level partition extraction (e.g., sort-by-partition then slice), but the current approach is correct and simple for a first pass.
DataFileMeta uses EMPTY_SERIALIZED_ROW for min_key/max_key and zero sequence numbers — this is fine for append-only but worth a TODO note if PK/compaction support is planned later.
The NULL datum handling fix (from the review thread) is correct — dropping NULL from bucket key datums and partition predicate specs was a real bug.

+1, good to merge.

JingsongLi force-pushed the writer branch from e91da7d to 00a4023 Compare April 11, 2026 03:17

littlecoder04 reviewed Apr 11, 2026

View reviewed changes

JingsongLi added 2 commits April 12, 2026 09:42

Fix comments

5d59939

JingsongLi force-pushed the writer branch from 5f3748a to 5d59939 Compare April 12, 2026 01:43

littlecoder04 reviewed Apr 12, 2026

View reviewed changes

JingsongLi added 2 commits April 12, 2026 17:31

Fix comment

b25cfa9

Fix more Null for bucket

71f3463

leaves12138 approved these changes Apr 12, 2026

View reviewed changes

JingsongLi merged commit ecbf458 into apache:main Apr 12, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(write): add write pipeline with DataFusion INSERT INTO/OVERWRITE support#234

feat(write): add write pipeline with DataFusion INSERT INTO/OVERWRITE support#234
JingsongLi merged 4 commits intoapache:mainfrom
JingsongLi:writer

JingsongLi commented Apr 11, 2026

Uh oh!

littlecoder04 Apr 11, 2026 •

edited

Loading

Uh oh!

JingsongLi Apr 12, 2026

Uh oh!

littlecoder04 Apr 12, 2026

Uh oh!

JingsongLi Apr 12, 2026

Uh oh!

littlecoder04 commented Apr 12, 2026

Uh oh!

leaves12138 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JingsongLi commented Apr 11, 2026

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

littlecoder04 Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

littlecoder04 Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

littlecoder04 commented Apr 12, 2026

Uh oh!

leaves12138 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

littlecoder04 Apr 11, 2026 •

edited

Loading