Dov/column paged merge batcher by DAlperin · Pull Request #36627 · MaterializeInc/materialize

DAlperin · 2026-05-19T19:40:22Z

Stacked on #36552 which in turn is stacked on #36391

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ive matches) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reuse buffers across iterations via iter_custom so allocator cost is paid once at setup. Read one u64 per page after take to force the kernel to actually fault pages in (relevant under memory pressure). 2 MiB single-chunk plus scatter sweep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds two chains of 2 MiB chunks then performs a merge pass that reads every cache line of both inputs and emits a new chain. Designed to be run under systemd-run with MemoryMax to simulate a working set that exceeds RAM, exposing real swap-eviction or disk-I/O cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add `prefetch(&Handle)` and `prefetch_at(&Handle, offset, len)` to let callers overlap the next chunk's I/O with current chunk processing. The swap backend issues `MADV_WILLNEED`; the file backend opens the scratch file briefly and issues `posix_fadvise(POSIX_FADV_WILLNEED)`, both of which kick async kernel work and return promptly. The merge example now prefetches one chunk ahead. With a 32 GiB working set and 16 GiB cap on ext4, file-backend merge drops from 47.7 s to 45.2 s. Swap-backend merge is unchanged at ~141 s because under that much pressure the kernel is reclaim-bound, not stall-bound. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per project policy, raw `as` is disallowed in favor of mz_ore::cast::CastFrom, mz_ore::cast::CastLossy, or std::convert::TryFrom. The pager's pointer-arithmetic paths now use stable `*const T::addr()` and `byte_add` to keep provenance, with `cast::<U>()` and `cast_mut()` replacing pointer-type `as` casts. FFI integer arguments now go through `try_from` with explicit panics on overflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Each worker gets a 1/threads share of the total chain so working set stays constant across thread counts. Cap=16 GiB / total chain=32 GiB: file backend speeds up at 2 threads (64 -> 46 s, ~1.4x), regresses at 4, recovers at 8; swap backend halves wall at 4 threads (215 -> 127 s) because kernel reclaim overlaps with other workers' compute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oughput and perf data Adds a section that captures the swap-vs-file trade-off as actually observed: file saturates the disk (1.47 GiB/s on encrypted NVMe), swap floors at ~0.36 GiB/s regardless of cap or parallelism. perf stat plus /proc/vmstat deltas show swap loses ~7x sys-time vs file because every 4 KiB readback page-faults synchronously on the user thread (5.2M minor-faults vs 4K, 2.1M pswpin vs 2.2K). Operational guidance: swap when resident, file when spilling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Without `required-features` cargo tries to build the example with the default feature set, where `#![cfg(feature = "pager")]` strips the entire file and leaves no `main`. Declare the feature requirement so the example is skipped when the feature is off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Measured ~5% improvement on the file path at depth 16, within run-to-run variance, and zero on the swap path. Not worth the API surface for v1. Kernel readahead handles the file path adequately; swap is reclaim-bound under pressure and prefetch can't help. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…less-of-parallelism claim The earlier "swap caps at ~0.36 GiB/s regardless of cap or parallelism" headline holds only at low thread counts. On a 64 vCPU box with two striped local NVMes, swap-backend merge scales 13× from 1 → 64 threads and reaches ~75% of file-backend throughput, because enough independent direct-reclaim contexts run in parallel to keep the swap stripe nearly busy. Reorganize the operational characteristics section into two benches — encrypted NVMe (1.4 GB/s ceiling) and r8gd.16xlarge with striped instance NVMe (~7 GB/s ceiling) — and add file-backend (1 TiB / cap 256G) and swap-backend (128 GiB / cap 32G) thread-scaling tables for the second. Operational guidance now distinguishes low-thread (file wins ~3–5×) from high-thread (within ~25%) regimes and calls out the multi-tenant RSS argument as a separate reason to prefer file regardless of throughput. Drop the dead --prefetch-depth 4 reference; that flag was removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `pageout_with(backend, chunks)` alongside `pageout`. Lets callers select the backend per call instead of going through the global atomic, so layered consumers (next commit's column-pager) can route swap and file pageouts independently without racing other writers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Bridges `mz_ore::pager` to typed `Column<C>` via `ContainerBytes`. Callers drain a column into a `PagedColumn<C>` and rehydrate on demand; backend and compression are decided per call by an injected `PagingPolicy`, not the pager's global atomic. Three resting variants cover the matrix: * `Resident(Column<C>)` — policy returned `Skip`. * `Paged { handle, meta }` — raw u64-aligned bytes via `pager::Handle`. * `Compressed { inner, meta }` — lz4 frame; bytes live either in memory or in a `pager::Handle` (padded to u64). Fast paths: * `Column::Align(Vec<u64>)` uncompressed — moves the body Vec into the handle, no copy on the swap backend. * Compressed — `FrameEncoder` wraps the target so `into_bytes` streams serialized bytes straight through lz4 with no uncompressed staging. * Compressed file — the frame trailer self-delimits, so no `compressed_len` field and no unpad on read. Tests cover skip, swap/file × uncompressed/lz4 round trips, and the align-variant fast path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds `ResidentTicket`, a drop guard carried inside `PagedColumn::Resident` that fires a new `PageEvent::ResidentReleased { bytes }` when the resident column is consumed via `ColumnPager::take` or dropped without being taken. Lets policies track outstanding resident memory without leaking budget if a caller drops a column unexpectedly. Introduces `TieredPolicy` in `column_pager::policy`. Each Timely worker draws from a fixed per-worker byte budget; once exhausted it falls back to a process-wide shared pool, and only when both are full does it page out via a configured backend and codec. Per-worker state lives in a `thread_local!` static so worker threads see independent counters. This limits the design to one `TieredPolicy` per process — sufficient for the expected configuration, and the constraint is documented. Release order returns budget to the shared pool first so other workers unblock sooner. The shared pool is a single `AtomicUsize` consumed via a CAS loop; only the cold fallback path touches it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Measures round-trip (`page` + `take`) and operator-loop (`page` with column reuse) throughput across three axes: column size (4 KiB, 256 KiB, 4 MiB), pager backend (Swap, File), and codec (uncompressed, lz4). 24 cases total, throughput reported in bytes/sec via Criterion's `Throughput`. Run with: cargo bench -p mz-timely-util --bench column_pager The bench uses an `AlwaysPage` stub policy so every iteration exercises the paging path rather than the resident fast path. Smoke-tested at 4 KiB/swap/raw at ~8.6 GiB/s on a development laptop, which is close to the underlying pager's memcpy ceiling and confirms the column-pager layer adds no measurable overhead at that size. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The pager's swap backend keeps the body Vec resident and hints MADV_COLD; the kernel evicts only under memory pressure. The column_pager bench round-trips one column at a time and never builds enough working set to trigger eviction, so swap-backend numbers measure the in-memory fast path (Vec move + bookkeeping), not the cost of a page-in from disk. Relabel the axis as `swap-warm` to make the distinction visible in every measurement name, and add a module-level caveat explaining what the numbers do and don't represent. A follow-up `column_pager_pressure` bench under `systemd-run --user --scope -p MemoryMax=...` will exercise the real eviction path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Declares lz4_flex in `[workspace.dependencies]` so `mz-timely-util`'s `lz4_flex.workspace = true` resolves. Required by the column-paged merge batcher's optional lz4 codec; the dep was referenced before being declared and broke workspace loading.

Adds a Materialize-private merge-batcher that routes per-chunk transient state through `ColumnPager`, bounding the resident-bytes peak under memory pressure. Behind `enable_column_paged_batcher` (default off). Three building blocks in `mz-timely-util`: * `ColumnMergeBatcher` + `merge_chains` + `extract_chain` in `columnar/merge_batcher.rs` — chains hold `PagedColumn` entries that resolve to disk on demand. Reuses the existing `Column::merge_from` / `Column::extract` building blocks. * `BuilderInput for Column<((K, V), T, R)>` so DD `OrdValBuilder` can consume the batcher's output without a container conversion. * `column_pager` gains a process-global pager singleton (matching the lower-level pager's global-atomic design) and a per-decision skip/page counter for diagnostics. Compute integration: * `RowRowColPagedBuilder` alias + `PartialEq<&RowRef> for DatumSeq` / `PushInto<&RowRef> for DatumContainer` so the Row-keyed arrange path type-checks. * Worker init in `apply_worker_config` reads three new dyncfgs and installs the process-global pager: `enable_column_paged_batcher` (on/off), `column_paged_batcher_backend` (`swap` | `file`), `column_paged_batcher_budget_fraction` (fraction of replica memory, default 5%). Per-worker / shared pool sizes derive from `memory_limiter::get_memory_limit` with sensible floors and caps. * Two arrange call sites switched to the paged path: `render/context.rs::arrange_collection` (central ArrangeBy) and `render/join/linear_join.rs::JoinStage`. Other arrange sites (logging) left on the legacy `ColInternalMerger` path. Also extends `Materialized` and `Clusterd` mzcompose services to accept `memory_swap` and `mem_swappiness`, so callers can configure container-level swap behavior independent of the batcher.

Adds three pieces of validation tooling for the column-paged merge batcher: a criterion microbench, an end-to-end timely example, and feature-benchmark scenarios. Criterion bench (`src/timely-util/benches/columnar_merge_batcher.rs`): compares the legacy `ColumnMerger` against the new path with disabled / swap / lz4 pagers across three input regimes (mixed, collisions, disjoint) and four cache-tier sizes. Prints a throughput summary table when the group finishes. Good for per-chunk-merge perf comparisons; doesn't exercise the dataflow operator graph. End-to-end example (`src/timely-util/examples/column_paged_spill.rs`): drives `arrange_core` over a cancellation workload (positives + negatives at the same time, so the spine stays empty and all pressure lives in the merge-batcher). Configurable workers / records / budget; back-to-back baseline + spill modes; optional RSS sampler thread via `ps`. Modeled on `differential-dataflow/examples/columnar_spill.rs` but uses our `Col2ValPagedBatcher` + `ColumnPager` + `TieredPolicy` directly instead of DD's `SpillBatcher`/`Threshold`/`FileSpill` plumbing. `cargo run --release --example column_paged_spill` for a smoke test; see `--help` for sweep options. Feature-benchmark scenarios (`misc/python/.../scenarios/benchmark_main.py`): * `DifferentialJoinColumnPaged` — same query shape as `DifferentialJoin`, paged batcher enabled. Measures steady-state overhead vs the legacy path. * `DifferentialJoinHydrationBaseline` / `DifferentialJoinHydrationFile` — sister leaves of a non-runnable `DifferentialJoinHydration` parent. Each measures the time to re-hydrate a linear-join arrangement after `REPLICATION FACTOR 0 -> 1` toggling. Baseline has the paged batcher off; File enables it with the file backend and `budget_fraction = 0.01` so chunks spill rather than competing with the spine for RAM. Compare under `--this-memory` + `--this-memory-swap` to evaluate user-space spill vs OS swap. Feature-benchmark CLI plumbing (`test/feature-benchmark/mzcompose.py`): adds `--this-memory`, `--this-memory-swap`, `--this-mem-swappiness` (and `--other-*` companions) so memory caps and swap behavior are configurable per side, plus `--skip-other` for iterating on `this` without the comparison round trip. The benchmark-result evaluator tolerates the single-side case by returning `None` ratios instead of indexing past the end of `_points`.

antiguru and others added 30 commits May 18, 2026 11:31

ore: add pager feature flag

4b52052

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: skeleton mz_ore::pager module with Backend enum

13d3239

ore: pager scratch dir lifecycle and stale-subdir reaper

8dea6b4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager Handle type and inner storage scaffolding

01d5359

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager swap backend pageout with MADV_COLD

027334b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager swap backend read_at_many

1c27263

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager swap backend take with zero-copy fast path

c1522ab

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager public dispatch surface (pageout/read_at/take)

b9e1d10

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager file backend pageout with pwritev

70f035c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager file backend read_at_many with coalescing

8c5b92b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager file backend take and drop reclaim

8db7d6c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager cross-backend integration tests

83f8c34

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager Criterion bench harness

84189cd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager clippy + lint cleanups (write_vectored, cast_from, exhaust…

e584512

…ive matches) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: pager copyright headers and test-attribute lint compliance

e469b6b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ore: update Cargo.lock for pager tempfile dev-dep

7569984

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DAlperin added 3 commits May 19, 2026 11:24

cargo: add lz4_flex workspace dep

898640b

Declares lz4_flex in `[workspace.dependencies]` so `mz-timely-util`'s `lz4_flex.workspace = true` resolves. Required by the column-paged merge batcher's optional lz4 codec; the dep was referenced before being declared and broke workspace loading.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dov/column paged merge batcher#36627

Dov/column paged merge batcher#36627
DAlperin wants to merge 33 commits into
MaterializeInc:mainfrom
DAlperin:dov/column-paged-merge-batcher

DAlperin commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DAlperin commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants