Skip to content

Dov/column paged merge batcher#36627

Draft
DAlperin wants to merge 33 commits into
MaterializeInc:mainfrom
DAlperin:dov/column-paged-merge-batcher
Draft

Dov/column paged merge batcher#36627
DAlperin wants to merge 33 commits into
MaterializeInc:mainfrom
DAlperin:dov/column-paged-merge-batcher

Conversation

@DAlperin
Copy link
Copy Markdown
Member

Stacked on #36552 which in turn is stacked on #36391

antiguru and others added 30 commits May 18, 2026 11:31
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ive matches)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reuse buffers across iterations via iter_custom so allocator cost is
paid once at setup. Read one u64 per page after take to force the
kernel to actually fault pages in (relevant under memory pressure).
2 MiB single-chunk plus scatter sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds two chains of 2 MiB chunks then performs a merge pass that
reads every cache line of both inputs and emits a new chain. Designed
to be run under systemd-run with MemoryMax to simulate a working set
that exceeds RAM, exposing real swap-eviction or disk-I/O cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `prefetch(&Handle)` and `prefetch_at(&Handle, offset, len)` to let
callers overlap the next chunk's I/O with current chunk processing.
The swap backend issues `MADV_WILLNEED`; the file backend opens the
scratch file briefly and issues `posix_fadvise(POSIX_FADV_WILLNEED)`,
both of which kick async kernel work and return promptly.

The merge example now prefetches one chunk ahead. With a 32 GiB working
set and 16 GiB cap on ext4, file-backend merge drops from 47.7 s to
45.2 s. Swap-backend merge is unchanged at ~141 s because under that
much pressure the kernel is reclaim-bound, not stall-bound.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per project policy, raw `as` is disallowed in favor of mz_ore::cast::CastFrom,
mz_ore::cast::CastLossy, or std::convert::TryFrom. The pager's pointer-arithmetic
paths now use stable `*const T::addr()` and `byte_add` to keep provenance, with
`cast::<U>()` and `cast_mut()` replacing pointer-type `as` casts. FFI integer
arguments now go through `try_from` with explicit panics on overflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each worker gets a 1/threads share of the total chain so working set
stays constant across thread counts. Cap=16 GiB / total chain=32 GiB:
file backend speeds up at 2 threads (64 -> 46 s, ~1.4x), regresses at
4, recovers at 8; swap backend halves wall at 4 threads (215 -> 127 s)
because kernel reclaim overlaps with other workers' compute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oughput and perf data

Adds a section that captures the swap-vs-file trade-off as actually
observed: file saturates the disk (1.47 GiB/s on encrypted NVMe), swap
floors at ~0.36 GiB/s regardless of cap or parallelism. perf stat plus
/proc/vmstat deltas show swap loses ~7x sys-time vs file because every
4 KiB readback page-faults synchronously on the user thread (5.2M
minor-faults vs 4K, 2.1M pswpin vs 2.2K). Operational guidance: swap
when resident, file when spilling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without `required-features` cargo tries to build the example with the
default feature set, where `#![cfg(feature = "pager")]` strips the
entire file and leaves no `main`. Declare the feature requirement so
the example is skipped when the feature is off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Measured ~5% improvement on the file path at depth 16, within run-to-run
variance, and zero on the swap path. Not worth the API surface for v1.
Kernel readahead handles the file path adequately; swap is reclaim-bound
under pressure and prefetch can't help.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…less-of-parallelism claim

The earlier "swap caps at ~0.36 GiB/s regardless of cap or parallelism"
headline holds only at low thread counts. On a 64 vCPU box with two striped
local NVMes, swap-backend merge scales 13× from 1 → 64 threads and reaches
~75% of file-backend throughput, because enough independent direct-reclaim
contexts run in parallel to keep the swap stripe nearly busy.

Reorganize the operational characteristics section into two benches —
encrypted NVMe (1.4 GB/s ceiling) and r8gd.16xlarge with striped instance
NVMe (~7 GB/s ceiling) — and add file-backend (1 TiB / cap 256G) and
swap-backend (128 GiB / cap 32G) thread-scaling tables for the second.
Operational guidance now distinguishes low-thread (file wins ~3–5×) from
high-thread (within ~25%) regimes and calls out the multi-tenant RSS
argument as a separate reason to prefer file regardless of throughput.

Drop the dead --prefetch-depth 4 reference; that flag was removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `pageout_with(backend, chunks)` alongside `pageout`. Lets callers
select the backend per call instead of going through the global atomic,
so layered consumers (next commit's column-pager) can route swap and
file pageouts independently without racing other writers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bridges `mz_ore::pager` to typed `Column<C>` via `ContainerBytes`.
Callers drain a column into a `PagedColumn<C>` and rehydrate on demand;
backend and compression are decided per call by an injected
`PagingPolicy`, not the pager's global atomic.

Three resting variants cover the matrix:

* `Resident(Column<C>)` — policy returned `Skip`.
* `Paged { handle, meta }` — raw u64-aligned bytes via `pager::Handle`.
* `Compressed { inner, meta }` — lz4 frame; bytes live either in memory
  or in a `pager::Handle` (padded to u64).

Fast paths:

* `Column::Align(Vec<u64>)` uncompressed — moves the body Vec into the
  handle, no copy on the swap backend.
* Compressed — `FrameEncoder` wraps the target so `into_bytes` streams
  serialized bytes straight through lz4 with no uncompressed staging.
* Compressed file — the frame trailer self-delimits, so no
  `compressed_len` field and no unpad on read.

Tests cover skip, swap/file × uncompressed/lz4 round trips, and the
align-variant fast path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `ResidentTicket`, a drop guard carried inside `PagedColumn::Resident`
that fires a new `PageEvent::ResidentReleased { bytes }` when the resident
column is consumed via `ColumnPager::take` or dropped without being
taken. Lets policies track outstanding resident memory without leaking
budget if a caller drops a column unexpectedly.

Introduces `TieredPolicy` in `column_pager::policy`. Each Timely worker
draws from a fixed per-worker byte budget; once exhausted it falls back
to a process-wide shared pool, and only when both are full does it page
out via a configured backend and codec. Per-worker state lives in a
`thread_local!` static so worker threads see independent counters. This
limits the design to one `TieredPolicy` per process — sufficient for the
expected configuration, and the constraint is documented.

Release order returns budget to the shared pool first so other workers
unblock sooner. The shared pool is a single `AtomicUsize` consumed via a
CAS loop; only the cold fallback path touches it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Measures round-trip (`page` + `take`) and operator-loop (`page` with
column reuse) throughput across three axes: column size (4 KiB, 256 KiB,
4 MiB), pager backend (Swap, File), and codec (uncompressed, lz4). 24
cases total, throughput reported in bytes/sec via Criterion's `Throughput`.

Run with:

    cargo bench -p mz-timely-util --bench column_pager

The bench uses an `AlwaysPage` stub policy so every iteration exercises
the paging path rather than the resident fast path. Smoke-tested at
4 KiB/swap/raw at ~8.6 GiB/s on a development laptop, which is close to
the underlying pager's memcpy ceiling and confirms the column-pager
layer adds no measurable overhead at that size.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The pager's swap backend keeps the body Vec resident and hints
MADV_COLD; the kernel evicts only under memory pressure. The
column_pager bench round-trips one column at a time and never builds
enough working set to trigger eviction, so swap-backend numbers measure
the in-memory fast path (Vec move + bookkeeping), not the cost of a
page-in from disk.

Relabel the axis as `swap-warm` to make the distinction visible in
every measurement name, and add a module-level caveat explaining what
the numbers do and don't represent. A follow-up `column_pager_pressure`
bench under `systemd-run --user --scope -p MemoryMax=...` will exercise
the real eviction path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DAlperin added 3 commits May 19, 2026 11:24
Declares lz4_flex in `[workspace.dependencies]` so `mz-timely-util`'s `lz4_flex.workspace = true` resolves. Required by the column-paged merge batcher's optional lz4 codec; the dep was referenced before being declared and broke workspace loading.
Adds a Materialize-private merge-batcher that routes per-chunk transient state through `ColumnPager`, bounding the resident-bytes peak under memory pressure. Behind `enable_column_paged_batcher` (default off).

Three building blocks in `mz-timely-util`: * `ColumnMergeBatcher` + `merge_chains` + `extract_chain` in `columnar/merge_batcher.rs` — chains hold `PagedColumn` entries that resolve to disk on demand. Reuses the existing `Column::merge_from` / `Column::extract` building blocks. * `BuilderInput for Column<((K, V), T, R)>` so DD `OrdValBuilder` can consume the batcher's output without a container conversion. * `column_pager` gains a process-global pager singleton (matching the lower-level pager's global-atomic design) and a per-decision skip/page counter for diagnostics.

Compute integration: * `RowRowColPagedBuilder` alias + `PartialEq<&RowRef> for DatumSeq` / `PushInto<&RowRef> for DatumContainer` so the Row-keyed arrange path type-checks. * Worker init in `apply_worker_config` reads three new dyncfgs and installs the process-global pager: `enable_column_paged_batcher` (on/off), `column_paged_batcher_backend` (`swap` | `file`), `column_paged_batcher_budget_fraction` (fraction of replica memory, default 5%). Per-worker / shared pool sizes derive from `memory_limiter::get_memory_limit` with sensible floors and caps. * Two arrange call sites switched to the paged path: `render/context.rs::arrange_collection` (central ArrangeBy) and `render/join/linear_join.rs::JoinStage`. Other arrange sites (logging) left on the legacy `ColInternalMerger` path.

Also extends `Materialized` and `Clusterd` mzcompose services to accept `memory_swap` and `mem_swappiness`, so callers can configure container-level swap behavior independent of the batcher.
Adds three pieces of validation tooling for the column-paged merge batcher: a criterion microbench, an end-to-end timely example, and feature-benchmark scenarios.

Criterion bench (`src/timely-util/benches/columnar_merge_batcher.rs`): compares the legacy `ColumnMerger` against the new path with disabled / swap / lz4 pagers across three input regimes (mixed, collisions, disjoint) and four cache-tier sizes. Prints a throughput summary table when the group finishes. Good for per-chunk-merge perf comparisons; doesn't exercise the dataflow operator graph.

End-to-end example (`src/timely-util/examples/column_paged_spill.rs`): drives `arrange_core` over a cancellation workload (positives + negatives at the same time, so the spine stays empty and all pressure lives in the merge-batcher). Configurable workers / records / budget; back-to-back baseline + spill modes; optional RSS sampler thread via `ps`. Modeled on `differential-dataflow/examples/columnar_spill.rs` but uses our `Col2ValPagedBatcher` + `ColumnPager` + `TieredPolicy` directly instead of DD's `SpillBatcher`/`Threshold`/`FileSpill` plumbing. `cargo run --release --example column_paged_spill` for a smoke test; see `--help` for sweep options.

Feature-benchmark scenarios (`misc/python/.../scenarios/benchmark_main.py`): * `DifferentialJoinColumnPaged` — same query shape as `DifferentialJoin`, paged batcher enabled. Measures steady-state overhead vs the legacy path. * `DifferentialJoinHydrationBaseline` / `DifferentialJoinHydrationFile` — sister leaves of a non-runnable `DifferentialJoinHydration` parent. Each measures the time to re-hydrate a linear-join arrangement after `REPLICATION FACTOR 0 -> 1` toggling. Baseline has the paged batcher off; File enables it with the file backend and `budget_fraction = 0.01` so chunks spill rather than competing with the spine for RAM. Compare under `--this-memory` + `--this-memory-swap` to evaluate user-space spill vs OS swap.

Feature-benchmark CLI plumbing (`test/feature-benchmark/mzcompose.py`): adds `--this-memory`, `--this-memory-swap`, `--this-mem-swappiness` (and `--other-*` companions) so memory caps and swap behavior are configurable per side, plus `--skip-other` for iterating on `this` without the comparison round trip. The benchmark-result evaluator tolerates the single-side case by returning `None` ratios instead of indexing past the end of `_points`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants