Skip to content

Commit 948324c

Browse files
authored
chore(compaction): add docs for append/pk compaction (#230)
1 parent ae7a66a commit 948324c

5 files changed

Lines changed: 230 additions & 17 deletions

File tree

ci/scripts/setup_ccache.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ echo "PAIMON_USE_CCACHE=ON" >> $GITHUB_ENV
1818

1919
echo "CCACHE_COMPILERCHECK=content" >> $GITHUB_ENV
2020
echo "CCACHE_DIR=${HOME}/.ccache" >> $GITHUB_ENV
21-
echo "CCACHE_MAXSIZE=1500M" >> $GITHUB_ENV
21+
echo "CCACHE_MAXSIZE=1G" >> $GITHUB_ENV
2222
echo "CCACHE_COMPRESS=true" >> $GITHUB_ENV
2323
echo "CCACHE_COMPRESSLEVEL=6" >> $GITHUB_ENV
2424

docs/source/user_guide.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ User Guide
2929
user_guide/append_only_table
3030
user_guide/write
3131
user_guide/commit
32+
user_guide/compaction
3233
user_guide/read
3334
user_guide/clean
3435
user_guide/prefetch
Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
.. Copyright 2026-present Alibaba Inc.
2+
3+
.. Licensed under the Apache License, Version 2.0 (the "License");
4+
.. you may not use this file except in compliance with the License.
5+
.. You may obtain a copy of the License at
6+
7+
.. http://www.apache.org/licenses/LICENSE-2.0
8+
9+
.. Unless required by applicable law or agreed to in writing, software
10+
.. distributed under the License is distributed on an "AS IS" BASIS,
11+
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
.. See the License for the specific language governing permissions and
13+
.. limitations under the License.
14+
15+
Compaction
16+
==========
17+
Compaction is the process of merging multiple small data files into fewer, larger
18+
files. It is a resource intensive procedure which consumes CPU time and disk IO,
19+
so too frequent compaction may result in slower writes. However, without
20+
compaction, the accumulation of small files degrades query performance. Tuning
21+
compaction is therefore a trade-off between write throughput and read efficiency.
22+
23+
.. note::
24+
- There can only be one job working on the same partition's compaction,
25+
otherwise it will cause conflicts.
26+
- C++ Paimon does not support producing changelog for now.
27+
- Compaction is disabled when ``write-only`` is set to ``true``, or when the
28+
table uses dynamic bucketing (``bucket = -1``) for append-only tables.
29+
- For a complete list of compaction-related configurations, see the
30+
:ref:`Options API Reference <cpp-api-options>`.
31+
32+
Append-Only Table Compaction
33+
----------------------------
34+
In append-only table, data files are simply appended in sequence order.
35+
Over time, many small files accumulate, which degrades read performance due to the
36+
overhead of opening and scanning numerous files.
37+
38+
Append-only table compaction merges multiple small files into fewer, larger files
39+
to improve read efficiency. The compaction is performed asynchronously and does
40+
not block writes.
41+
42+
.. note::
43+
Append-only table compaction is only available for fixed-bucket mode
44+
(``bucket > 0``). Dynamic bucketing (``bucket = -1``) does not support
45+
compaction. Tables with blob columns also skip compaction.
46+
47+
Auto Compaction
48+
~~~~~~~~~~~~~~~
49+
During each flush, the writer triggers a best-effort auto compaction. The
50+
compaction picker scans the file queue ordered by sequence number and selects a
51+
contiguous window of files for merging when the number of candidate files reaches
52+
the ``compaction.min.file-num`` threshold.
53+
54+
Full Compaction
55+
~~~~~~~~~~~~~~~
56+
Full compaction rewrites all eligible files in the bucket. During full
57+
compaction:
58+
59+
- Files whose size is already at or above ``compaction.file-size`` (and have no
60+
associated deletion vectors) are skipped to avoid unnecessary rewrites.
61+
- When deletion vectors are enabled, all files are always eligible for
62+
compaction regardless of size, because deletion vectors must be applied.
63+
- When ``compaction.force-rewrite-all-files`` is ``true``, all files are
64+
rewritten unconditionally.
65+
- Without deletion vectors, full compaction only proceeds when the number of
66+
small files exceeds the number of large files and the total file count is at
67+
least 3.
68+
69+
After compaction, if the last output file is still smaller than
70+
``compaction.file-size``, it is placed back into the compaction queue for future
71+
merging.
72+
73+
Append-Only Table Compaction Options
74+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
75+
76+
.. list-table::
77+
:header-rows: 1
78+
:widths: 30 10 10 10 40
79+
80+
* - Option
81+
- Required
82+
- Default
83+
- Type
84+
- Description
85+
* - ``compaction.min.file-num``
86+
- No
87+
- 5
88+
- Integer
89+
- The minimum number of files to trigger an auto compaction for
90+
append-only tables.
91+
92+
93+
Primary Key Table Compaction
94+
----------------------------
95+
Primary key tables use an LSM tree (log-structured merge-tree) for file storage.
96+
When more and more records are written, the number of sorted runs increases.
97+
Because querying an LSM tree requires all sorted runs to be combined, too many
98+
sorted runs will result in poor query performance, or even out of memory.
99+
100+
To limit the number of sorted runs, several sorted runs are merged into one big
101+
sorted run once in a while. Paimon currently adopts a compaction strategy similar
102+
to RocksDB's `universal compaction
103+
<https://github.com/facebook/rocksdb/wiki/Universal-Compaction>`_.
104+
105+
Primary key table compaction solves:
106+
107+
- Reduce Level 0 files to avoid poor query performance.
108+
- Produce deletion vectors for MOW mode.
109+
110+
Full Compaction
111+
~~~~~~~~~~~~~~~
112+
Paimon uses Universal Compaction. By default, when there is too much incremental
113+
data, Full Compaction will be automatically performed. You don't usually have to
114+
worry about it.
115+
116+
Paimon also provides configurations that allow for regular execution of Full
117+
Compaction:
118+
119+
- ``compaction.optimization-interval``: Implying how often to perform an
120+
optimization full compaction. This configuration is used to ensure the query
121+
timeliness of the read-optimized system table.
122+
- ``compaction.total-size-threshold``: Full compaction will be constantly triggered
123+
when total size is smaller than this threshold.
124+
- ``compaction.incremental-size-threshold``: Full compaction will be constantly
125+
triggered when incremental size is bigger than this threshold.
126+
127+
Lookup Compaction
128+
~~~~~~~~~~~~~~~~~
129+
When a primary key table is configured with ``lookup`` changelog producer or
130+
``first-row`` merge engine or has enabled deletion vectors for MOW mode, Paimon
131+
will use a radical compaction strategy to force compacting level 0 files to
132+
higher levels for every compaction trigger.
133+
134+
Paimon also provides configurations to optimize the frequency of this
135+
compaction:
136+
137+
- ``lookup-compact``: compact mode used for lookup compaction. Possible values:
138+
139+
* ``radical``: will use ``ForceUpLevel0Compaction`` strategy to radically
140+
compact new files.
141+
* ``gentle``: will use ``UniversalCompaction`` strategy to gently compact new
142+
files.
143+
144+
- ``lookup-compact.max-interval``: The max interval for a forced L0 lookup
145+
compaction to be triggered in ``gentle`` mode. This option is only valid when
146+
``lookup-compact`` mode is ``gentle``.
147+
148+
By configuring ``lookup-compact`` as ``gentle``, new files in L0 will not be
149+
compacted immediately. This may greatly reduce the overall resource usage at the
150+
expense of worse data freshness in certain cases.
151+
152+
Primary Key Table Compaction Options
153+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
154+
155+
Number of Sorted Runs to Pause Writing
156+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
157+
When the number of sorted runs is small, Paimon writers will perform compaction
158+
asynchronously in separated threads, so records can be continuously written into
159+
the table. However, to avoid unbounded growth of sorted runs, writers will pause
160+
writing when the number of sorted runs hits the threshold.
161+
162+
.. list-table::
163+
:header-rows: 1
164+
:widths: 30 10 10 10 40
165+
166+
* - Option
167+
- Required
168+
- Default
169+
- Type
170+
- Description
171+
* - ``num-sorted-run.stop-trigger``
172+
- No
173+
- (none)
174+
- Integer
175+
- The number of sorted runs that trigger the stopping of writes. The
176+
default value is ``num-sorted-run.compaction-trigger + 3``.
177+
178+
Write stalls will become less frequent when ``num-sorted-run.stop-trigger``
179+
becomes larger, thus improving writing performance. However, if this value
180+
becomes too large, more memory and CPU time will be needed when querying the
181+
table.
182+
183+
Number of Sorted Runs to Trigger Compaction
184+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
185+
Paimon uses LSM tree which supports a large number of updates. LSM organizes
186+
files in several sorted runs. When querying records from an LSM tree, all sorted
187+
runs must be combined to produce a complete view of all records.
188+
189+
One can easily see that too many sorted runs will result in poor query
190+
performance. To keep the number of sorted runs in a reasonable range, Paimon
191+
writers will automatically perform compactions. The following table property
192+
determines the minimum number of sorted runs to trigger a compaction.
193+
194+
.. list-table::
195+
:header-rows: 1
196+
:widths: 30 10 10 10 40
197+
198+
* - Option
199+
- Required
200+
- Default
201+
- Type
202+
- Description
203+
* - ``num-sorted-run.compaction-trigger``
204+
- No
205+
- 5
206+
- Integer
207+
- The sorted run number to trigger compaction. Includes level 0 files (one
208+
file one sorted run) and high-level runs (one level one sorted run).
209+
210+
Compaction will become less frequent when ``num-sorted-run.compaction-trigger``
211+
becomes larger, thus improving writing performance. However, if this value
212+
becomes too large, more memory and CPU time will be needed when querying the
213+
table. This is a trade-off between writing and query performance.

docs/source/user_guide/read.rst

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@
1212
.. See the License for the specific language governing permissions and
1313
.. limitations under the License.
1414
15-
Read and Data Evolution
16-
===============================================
15+
Read
16+
====
1717
Paimon by functionality can be divided into two layers:
1818

1919
- Control Plane: Responsible for accessing and managing Meta (snapshot, manifest, etc.), including:
@@ -25,7 +25,7 @@ Paimon by functionality can be divided into two layers:
2525
- Readers for various file formats
2626
- Coordinated reading of file collections
2727

28-
The control plane and data plane interact primarily via DataSplit (the query plan). Java currently supports a standard
28+
The control plane and data plane interact primarily via DataSplit (the query plan). C++ Paimon currently supports a standard
2929
DataSplit protocol which includes the necessary meta information to access data files. With DataSplit, a high-performance
3030
data access path can be integrated.
3131

@@ -39,10 +39,9 @@ across the two language ecosystems.
3939

4040

4141
Schema Evolution
42-
================
43-
44-
Scope and Compatibility
4542
-----------------------
43+
Scope and Compatibility
44+
~~~~~~~~~~~~~~~~~~~~~~~~
4645

4746
C++ Paimon supports all evolution kinds available in Java Paimon for non-nested types:
4847

@@ -62,12 +61,12 @@ C++ Paimon supports all evolution kinds available in Java Paimon for non-nested
6261
- Other operations are supported (consistent with Java Paimon).
6362

6463
Per-File Schema via Field IDs
65-
-----------------------------
64+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6665

6766
In DataSplit, each file may have a completely different data schema. Paimon uses field IDs to uniquely identify fields.
6867

6968
Overflow Behavior Disclaimer
70-
----------------------------
69+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7170

7271
Overflow behavior is undefined for C++ and Java Paimon. Results in overflow scenarios may:
7372

@@ -79,8 +78,7 @@ C++ Paimon does not guarantee identical results to Java Paimon in overflow scena
7978
return values between implementations.
8079

8180
Type Change Support Matrix
82-
--------------------------
83-
81+
~~~~~~~~~~~~~~~~~~~~~~~~~~
8482
The table below indicates support for changing a column type from ``source`` to ``target``. Refer to the numbered notes below the table
8583
for caveats.
8684

@@ -292,7 +290,7 @@ for caveats.
292290
- Example input: ``1111111111111111111111111111111111111.15``, Java returns: ``1111111111111111111111111111111111111.2``, C++ returns: ``null``
293291

294292
Implementation Guidance
295-
-----------------------
293+
~~~~~~~~~~~~~~~~~~~~~~~
296294

297295
- Use DataSplit as the sole interface between control and data planes. Treat it as the canonical query plan contract.
298296
- Resolve field types and IDs per file; prefer inline data file metadata, fallback to table schema files when necessary.

docs/source/user_guide/write.rst

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@
1212
.. See the License for the specific language governing permissions and
1313
.. limitations under the License.
1414
15-
Write And Prepare Commit
16-
========================
15+
Write
16+
=====
1717
Batch writing requires the compute engine to pre-bucket data (bucket), using the
1818
same bucketing strategy as Paimon to ensure correct ``Scan`` behavior, and to
1919
specify the target ``partition``. Data should be accumulated into ``RecordBatch``
@@ -42,6 +42,7 @@ Bucketing Modes
4242

4343
- PK tables:
4444

45+
* Support ``bucket = -2`` (postpone bucket mode)
4546
* Support ``bucket > 0`` (fixed bucket mode)
4647

4748
.. note::
@@ -111,15 +112,15 @@ The ``CommitMessage`` must encode all information required by the coordinator to
111112
produce a correct ``Snapshot``, which commonly includes (but is not limited to):
112113

113114
- Partition and bucket identifiers associated with written data.
114-
- New data files, delete files, or changelog artifacts (as applicable to the table type).
115+
- New data files, delete files (as applicable to the table type).
115116
- File-level metadata required for manifest and index updates (e.g., row counts, min/max statistics where applicable).
116117
- Transactional markers and sequence numbers as required by table semantics.
117118
- Any per-writer state necessary for deduplication or idempotent commits.
118119

119120
.. note::
120121

121-
Current C++ scope supports Append and PK tables. Changelog and index
122-
artifacts are out of scope and should not be emitted in ``CommitMessage`` until
122+
Current C++ scope supports Append and PK tables. Changelog is out of
123+
scope and should not be emitted in ``CommitMessage`` until
123124
explicitly supported.
124125

125126
Serialization and Deserialization

0 commit comments

Comments
 (0)