Dov/column paged merge batcher#36627
Draft
DAlperin wants to merge 33 commits into
Draft
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ive matches) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reuse buffers across iterations via iter_custom so allocator cost is paid once at setup. Read one u64 per page after take to force the kernel to actually fault pages in (relevant under memory pressure). 2 MiB single-chunk plus scatter sweep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds two chains of 2 MiB chunks then performs a merge pass that reads every cache line of both inputs and emits a new chain. Designed to be run under systemd-run with MemoryMax to simulate a working set that exceeds RAM, exposing real swap-eviction or disk-I/O cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `prefetch(&Handle)` and `prefetch_at(&Handle, offset, len)` to let callers overlap the next chunk's I/O with current chunk processing. The swap backend issues `MADV_WILLNEED`; the file backend opens the scratch file briefly and issues `posix_fadvise(POSIX_FADV_WILLNEED)`, both of which kick async kernel work and return promptly. The merge example now prefetches one chunk ahead. With a 32 GiB working set and 16 GiB cap on ext4, file-backend merge drops from 47.7 s to 45.2 s. Swap-backend merge is unchanged at ~141 s because under that much pressure the kernel is reclaim-bound, not stall-bound. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per project policy, raw `as` is disallowed in favor of mz_ore::cast::CastFrom, mz_ore::cast::CastLossy, or std::convert::TryFrom. The pager's pointer-arithmetic paths now use stable `*const T::addr()` and `byte_add` to keep provenance, with `cast::<U>()` and `cast_mut()` replacing pointer-type `as` casts. FFI integer arguments now go through `try_from` with explicit panics on overflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each worker gets a 1/threads share of the total chain so working set stays constant across thread counts. Cap=16 GiB / total chain=32 GiB: file backend speeds up at 2 threads (64 -> 46 s, ~1.4x), regresses at 4, recovers at 8; swap backend halves wall at 4 threads (215 -> 127 s) because kernel reclaim overlaps with other workers' compute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oughput and perf data Adds a section that captures the swap-vs-file trade-off as actually observed: file saturates the disk (1.47 GiB/s on encrypted NVMe), swap floors at ~0.36 GiB/s regardless of cap or parallelism. perf stat plus /proc/vmstat deltas show swap loses ~7x sys-time vs file because every 4 KiB readback page-faults synchronously on the user thread (5.2M minor-faults vs 4K, 2.1M pswpin vs 2.2K). Operational guidance: swap when resident, file when spilling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without `required-features` cargo tries to build the example with the default feature set, where `#![cfg(feature = "pager")]` strips the entire file and leaves no `main`. Declare the feature requirement so the example is skipped when the feature is off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Measured ~5% improvement on the file path at depth 16, within run-to-run variance, and zero on the swap path. Not worth the API surface for v1. Kernel readahead handles the file path adequately; swap is reclaim-bound under pressure and prefetch can't help. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…less-of-parallelism claim The earlier "swap caps at ~0.36 GiB/s regardless of cap or parallelism" headline holds only at low thread counts. On a 64 vCPU box with two striped local NVMes, swap-backend merge scales 13× from 1 → 64 threads and reaches ~75% of file-backend throughput, because enough independent direct-reclaim contexts run in parallel to keep the swap stripe nearly busy. Reorganize the operational characteristics section into two benches — encrypted NVMe (1.4 GB/s ceiling) and r8gd.16xlarge with striped instance NVMe (~7 GB/s ceiling) — and add file-backend (1 TiB / cap 256G) and swap-backend (128 GiB / cap 32G) thread-scaling tables for the second. Operational guidance now distinguishes low-thread (file wins ~3–5×) from high-thread (within ~25%) regimes and calls out the multi-tenant RSS argument as a separate reason to prefer file regardless of throughput. Drop the dead --prefetch-depth 4 reference; that flag was removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `pageout_with(backend, chunks)` alongside `pageout`. Lets callers select the backend per call instead of going through the global atomic, so layered consumers (next commit's column-pager) can route swap and file pageouts independently without racing other writers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bridges `mz_ore::pager` to typed `Column<C>` via `ContainerBytes`.
Callers drain a column into a `PagedColumn<C>` and rehydrate on demand;
backend and compression are decided per call by an injected
`PagingPolicy`, not the pager's global atomic.
Three resting variants cover the matrix:
* `Resident(Column<C>)` — policy returned `Skip`.
* `Paged { handle, meta }` — raw u64-aligned bytes via `pager::Handle`.
* `Compressed { inner, meta }` — lz4 frame; bytes live either in memory
or in a `pager::Handle` (padded to u64).
Fast paths:
* `Column::Align(Vec<u64>)` uncompressed — moves the body Vec into the
handle, no copy on the swap backend.
* Compressed — `FrameEncoder` wraps the target so `into_bytes` streams
serialized bytes straight through lz4 with no uncompressed staging.
* Compressed file — the frame trailer self-delimits, so no
`compressed_len` field and no unpad on read.
Tests cover skip, swap/file × uncompressed/lz4 round trips, and the
align-variant fast path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `ResidentTicket`, a drop guard carried inside `PagedColumn::Resident`
that fires a new `PageEvent::ResidentReleased { bytes }` when the resident
column is consumed via `ColumnPager::take` or dropped without being
taken. Lets policies track outstanding resident memory without leaking
budget if a caller drops a column unexpectedly.
Introduces `TieredPolicy` in `column_pager::policy`. Each Timely worker
draws from a fixed per-worker byte budget; once exhausted it falls back
to a process-wide shared pool, and only when both are full does it page
out via a configured backend and codec. Per-worker state lives in a
`thread_local!` static so worker threads see independent counters. This
limits the design to one `TieredPolicy` per process — sufficient for the
expected configuration, and the constraint is documented.
Release order returns budget to the shared pool first so other workers
unblock sooner. The shared pool is a single `AtomicUsize` consumed via a
CAS loop; only the cold fallback path touches it.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Measures round-trip (`page` + `take`) and operator-loop (`page` with
column reuse) throughput across three axes: column size (4 KiB, 256 KiB,
4 MiB), pager backend (Swap, File), and codec (uncompressed, lz4). 24
cases total, throughput reported in bytes/sec via Criterion's `Throughput`.
Run with:
cargo bench -p mz-timely-util --bench column_pager
The bench uses an `AlwaysPage` stub policy so every iteration exercises
the paging path rather than the resident fast path. Smoke-tested at
4 KiB/swap/raw at ~8.6 GiB/s on a development laptop, which is close to
the underlying pager's memcpy ceiling and confirms the column-pager
layer adds no measurable overhead at that size.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The pager's swap backend keeps the body Vec resident and hints MADV_COLD; the kernel evicts only under memory pressure. The column_pager bench round-trips one column at a time and never builds enough working set to trigger eviction, so swap-backend numbers measure the in-memory fast path (Vec move + bookkeeping), not the cost of a page-in from disk. Relabel the axis as `swap-warm` to make the distinction visible in every measurement name, and add a module-level caveat explaining what the numbers do and don't represent. A follow-up `column_pager_pressure` bench under `systemd-run --user --scope -p MemoryMax=...` will exercise the real eviction path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Declares lz4_flex in `[workspace.dependencies]` so `mz-timely-util`'s `lz4_flex.workspace = true` resolves. Required by the column-paged merge batcher's optional lz4 codec; the dep was referenced before being declared and broke workspace loading.
Adds a Materialize-private merge-batcher that routes per-chunk transient state through `ColumnPager`, bounding the resident-bytes peak under memory pressure. Behind `enable_column_paged_batcher` (default off). Three building blocks in `mz-timely-util`: * `ColumnMergeBatcher` + `merge_chains` + `extract_chain` in `columnar/merge_batcher.rs` — chains hold `PagedColumn` entries that resolve to disk on demand. Reuses the existing `Column::merge_from` / `Column::extract` building blocks. * `BuilderInput for Column<((K, V), T, R)>` so DD `OrdValBuilder` can consume the batcher's output without a container conversion. * `column_pager` gains a process-global pager singleton (matching the lower-level pager's global-atomic design) and a per-decision skip/page counter for diagnostics. Compute integration: * `RowRowColPagedBuilder` alias + `PartialEq<&RowRef> for DatumSeq` / `PushInto<&RowRef> for DatumContainer` so the Row-keyed arrange path type-checks. * Worker init in `apply_worker_config` reads three new dyncfgs and installs the process-global pager: `enable_column_paged_batcher` (on/off), `column_paged_batcher_backend` (`swap` | `file`), `column_paged_batcher_budget_fraction` (fraction of replica memory, default 5%). Per-worker / shared pool sizes derive from `memory_limiter::get_memory_limit` with sensible floors and caps. * Two arrange call sites switched to the paged path: `render/context.rs::arrange_collection` (central ArrangeBy) and `render/join/linear_join.rs::JoinStage`. Other arrange sites (logging) left on the legacy `ColInternalMerger` path. Also extends `Materialized` and `Clusterd` mzcompose services to accept `memory_swap` and `mem_swappiness`, so callers can configure container-level swap behavior independent of the batcher.
Adds three pieces of validation tooling for the column-paged merge batcher: a criterion microbench, an end-to-end timely example, and feature-benchmark scenarios. Criterion bench (`src/timely-util/benches/columnar_merge_batcher.rs`): compares the legacy `ColumnMerger` against the new path with disabled / swap / lz4 pagers across three input regimes (mixed, collisions, disjoint) and four cache-tier sizes. Prints a throughput summary table when the group finishes. Good for per-chunk-merge perf comparisons; doesn't exercise the dataflow operator graph. End-to-end example (`src/timely-util/examples/column_paged_spill.rs`): drives `arrange_core` over a cancellation workload (positives + negatives at the same time, so the spine stays empty and all pressure lives in the merge-batcher). Configurable workers / records / budget; back-to-back baseline + spill modes; optional RSS sampler thread via `ps`. Modeled on `differential-dataflow/examples/columnar_spill.rs` but uses our `Col2ValPagedBatcher` + `ColumnPager` + `TieredPolicy` directly instead of DD's `SpillBatcher`/`Threshold`/`FileSpill` plumbing. `cargo run --release --example column_paged_spill` for a smoke test; see `--help` for sweep options. Feature-benchmark scenarios (`misc/python/.../scenarios/benchmark_main.py`): * `DifferentialJoinColumnPaged` — same query shape as `DifferentialJoin`, paged batcher enabled. Measures steady-state overhead vs the legacy path. * `DifferentialJoinHydrationBaseline` / `DifferentialJoinHydrationFile` — sister leaves of a non-runnable `DifferentialJoinHydration` parent. Each measures the time to re-hydrate a linear-join arrangement after `REPLICATION FACTOR 0 -> 1` toggling. Baseline has the paged batcher off; File enables it with the file backend and `budget_fraction = 0.01` so chunks spill rather than competing with the spine for RAM. Compare under `--this-memory` + `--this-memory-swap` to evaluate user-space spill vs OS swap. Feature-benchmark CLI plumbing (`test/feature-benchmark/mzcompose.py`): adds `--this-memory`, `--this-memory-swap`, `--this-mem-swappiness` (and `--other-*` companions) so memory caps and swap behavior are configurable per side, plus `--skip-other` for iterating on `this` without the comparison round trip. The benchmark-result evaluator tolerates the single-side case by returning `None` ratios instead of indexing past the end of `_points`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #36552 which in turn is stacked on #36391