Skip to content

DNM: envd scalability work#36634

Draft
aljoscha wants to merge 102 commits into
MaterializeInc:mainfrom
aljoscha:envd-ddl-scalability
Draft

DNM: envd scalability work#36634
aljoscha wants to merge 102 commits into
MaterializeInc:mainfrom
aljoscha:envd-ddl-scalability

Conversation

@aljoscha
Copy link
Copy Markdown
Contributor

For running spec sheet

aljoscha and others added 30 commits May 13, 2026 11:30
The existing scenarios scale cluster size or envd CPU cores -- nothing
measures how adapter/envd latency moves as the catalog itself grows. Add
two scenarios under a new `envd_scalability` group that fix the
measurement cluster and vary the number of catalog objects.

`envd_scalability_tables` puts N empty tables in the catalog -- pure
catalog/adapter pressure, no controller load. `envd_scalability_mvs`
does N materialized views over a single 1-row base table -- same
catalog footprint, plus controller load proportional to N. The MV
scenario shards across single-replica pad clusters at 10000 MVs per
cluster (so 100k MVs spans 10 clusters), since one cluster can't
reasonably host that many dataflows.

For each N in {1, 10, 100, 1k, 3k, 5k, 10k, 20k, 30k, 50k, 100k} we run
10 reps each of `CREATE TABLE` (DDL through the coordinator) and
`SELECT * FROM <1-row table>` (a simple peek on a fixed 100cc cluster).
The catalog is built incrementally across size points, so going from
N=k to the next size point only adds (next - k) objects -- otherwise
we'd pay an O(sizes * N) build cost. The size list is overridable via
`--envd-scalability-sizes` for scaffolding runs.

Results land in a third CSV (`*.envd_scalability.csv`) reusing the
cluster CSV schema; `mode='envd_scalability'` distinguishes the rows.
Test analytics rides on the existing `cluster_spec_sheet_result` table
-- no schema change needed. The analyzer plots `time_ms` vs N per
(scenario, category, test_name).

This is going to be long-running, especially the MV scenario where each
create exercises the controller -- expect hours for the full size
range.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two new scenarios -- cluster_object_limits_indexes and
cluster_object_limits_mvs -- that find, per cluster size, the maximum
number of idle materializations one cluster can keep fresh.

The materializations are derived from a one-row, never-updated base
table so the only work the cluster has to do is keep advancing each
materialization's write_frontier in step with the upstream table. Once
the cluster can't keep up, freshness collapses; the driver records the
largest N at which `max(local_lag) < 2s` was still achievable, with the
unhealthy data point recorded too so the cliff is visible.

Staging-only (rejects --target=cloud-production), to avoid burning
production resources on long object-limit searches.
…lability default at 50k

When a materialization stalls completely (write_frontier never advances
past the minimum timestamp), `mz_internal.mz_materialization_lag` reports
`now() - 0` = current unix time in ms (~1.78e12). Recorded as-is this
crushes every healthy data point to ~0 on the plot. Cap the recorded
value at 10x the healthy threshold (= 20 s), preserve the underlying
truth via the `healthy` column, and label the plot to make the cap and
healthy threshold explicit.

Also drop 100_000 from the envd_scalability default size list: 50_000 is
a more sensible default ceiling for staging. The full size list is still
override-able via --envd-scalability-sizes for ad-hoc runs.
…tion

The release-qualification pipeline already runs three cluster-spec-sheet
groups (cluster_compute on production, source_ingestion on production,
environmentd on staging). Add two more groups -- envd_scalability and
cluster_object_limits -- both running against staging, since both push
the catalog / cluster to limits we don't want to exercise on production.
The three "envd / cluster" groups in the cluster-spec-sheet were named
inconsistently. Settle on the three concept names the cluster-spec-sheet
effort uses verbally:

  environmentd          -> envd_qps_scalability     (QPS vs envd CPU)
  envd_scalability      -> envd_objects_scalability (latency vs catalog N)
  cluster_object_limits -> cluster_object_limits    (unchanged)

Renames apply to: scenario constants, scenario-name string values, group
keys in SCENARIO_GROUPS, class names, the run/analyze function names,
the --envd-scalability-sizes CLI flag, the result CSV suffix, and the
`mode` field written into CSV rows. The pre-existing QPS scenarios keep
their individual `*_envd_strong_scaling` names since only the group is
renamed.

Also updates the release-qualification pipeline step ids/args and the
README to match.
…w start

When debugging cluster-spec-sheet runs on staging it's hard to tell which
environment we're actually talking to and whether the system parameter
defaults we expect (lifted via LaunchDarkly or similar) are actually
applied. Add a one-shot diagnostic right after target.initialize() that
prints mz_environment_id() and SHOWs the limits the test depends on
(max_tables, max_materialized_views, max_objects_per_schema, max_clusters,
max_credit_consumption_rate, memory_limiter_interval).

Best-effort: any probe error is logged and swallowed so a transient
failure does not abort the workflow.
psycopg3's execute() requires a LiteralString, so the f-string SHOW
query tripped pyright in CI. Compose the statement with
psycopg.sql.SQL/Identifier instead, matching the pattern already used in
test/orchestratord/mzcompose.py.
A staging run of `envd_objects_scalability_mvs` (release-qualification
build 1219) aborted at N≈19800 with:

    Retryable error: consuming input failed: SSL error: unexpected eof
    while reading, reconnecting...
    psycopg.errors.InternalError_: materialized view
    "materialize.pad_schema.pad_mv_19805" already exists

The TLS connection dropped mid-statement; envd had already committed the
CREATE but the response was lost. ConnectionHandler.retryable reconnects
and replays the same statement, which then fails with "already exists".

Use ``CREATE ... IF NOT EXISTS`` for every CREATE issued via _bulk_run so
the retry is a no-op. Affects the bulk-creation paths in both
envd_objects_scalability scenarios (tables, MVs) and both
cluster_object_limits scenarios (indexed views, MVs). Add a docstring on
_bulk_run spelling out the idempotency requirement so future CREATEs
don't reintroduce the hazard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 50k scale point pushes a single Envd Objects Scalability run past
the 13-hour mark on staging — adapter latency degrades so much by then
that each measurement repetition takes several seconds, and the catalog
build itself runs at <1/s. 30k is where the interesting signal already
lives. Drop 50k from the default list; ad-hoc runs that want it can
still pass --envd-objects-scalability-sizes explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cluster_object_limits N-list defaults to a +1k linear step past
N=1000, which is too coarse: a run on staging showed the cliff sits in
(1000, 2000] for cluster_object_limits_indexes across every cluster
size 100cc..1600cc, and we can't tell from that whether the limit is
1100 or 1900.

After the coarse N-walk hits its first unhealthy point, bisect the
(last_healthy, first_unhealthy) interval --cluster-object-limits-bisect-
steps times (default 4) and probe each midpoint. The bisection step adds
or drops objects in place — never rebuilds the catalog — so the cost is
only ~bisect_steps extra hydrate-and-probe rounds per cluster size. With
the default 4 steps, the cliff narrows to ±~60 objects.

Adds:
- `remove_objects(target_n)` symmetric to `add_objects(target_n)` on
  both ClusterObjectLimitsScenario subclasses. Indexes scenario drops
  via DROP VIEW ... CASCADE (cascades to the default index); MVs
  scenario drops via DROP MATERIALIZED VIEW.
- `--cluster-object-limits-bisect-steps` CLI flag plumbed through to
  `run_scenario_cluster_object_limits`.
- Bisection block in the per-cluster-size loop that calls add+remove
  (one is a no-op) and records each probe under the same CSV schema, so
  the existing freshness-lag-vs-N plot just gets denser near the cliff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If CREATE CLUSTER fails for a cluster_object_limits size — because the
target region doesn't expose that replica size, or because allocating
the cluster would exceed max_credit_consumption_rate — today the
scenario either aborts with a traceback or (when the cluster is created
but then can't actually keep up) reports a confusing "unhealthy at
N=100" data point.

Catch psycopg.errors.DatabaseError around the CREATE CLUSTER, log a
clear "size unavailable" line (with the underlying error class +
message), and `continue` to the next cluster size. OperationalError is
re-raised so that genuine connection failures (which run_query's retry
loop has already given up on) aren't silently masked as a size problem.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cluster_object_limits plots (max-healthy-N bars + lag-vs-N legend)
ordered cluster sizes alphanumerically — "100cc, 1600cc, 200cc, 3200cc,
400cc, 800cc" — making the small→large progression unreadable.

Lift the existing `extract_cluster_size` helper to module scope and use
it to reindex the bar plot's index and reorder the line plot's columns.
The cluster-results path was already using it for its x-axis, so the
extraction is just hoisted, not duplicated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In build 1223 (release-qualification), 1600cc/3200cc showed implausibly
small max-healthy-N — 1600cc reported 0 healthy indexes / 93 healthy MVs
where 100cc–800cc routinely handled 1500+ indexes and 687+ MVs. Probing
the first N (=100) on a freshly-created 1600cc cluster returned
local_lag values of 90+ seconds for indexes and ~unix-epoch-ms for MVs
(i.e. write_frontier stuck at zero). Once that first probe declared the
cluster unhealthy, the bisect could not recover: each successive sample
measured more accumulated lag, not less, because the cluster never got
a chance to settle.

Likely cause is cold-start: bigger replicas take longer to begin
serving frontiers after CREATE CLUSTER + bulk DDL, and the 60s
hydration window expires before steady state. Bump it to 300s as a
first diagnostic — if 1600cc/3200cc now look healthy at reasonable N,
this confirms the hypothesis and we can keep the higher timeout (or
make it size-dependent). If they still look broken, the issue is
elsewhere (provisioning, multi-process replica semantics, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ult cap

The envd_objects_scalability default cap was reduced from 100k to 30k in
bdb6607 but two comments still referenced the older shape. Update the
system-parameter rationale to describe the headroom relationship between
the lifted ceilings (200k) and the user-configurable cap (default 30k),
and update the analyzer docstring to match.
…base_t

EnvdObjectsScalabilityMvsScenario called its one-row base table
pad_schema.pad_base while ClusterObjectLimitsScenario called the same
shape table pad_schema.base_t. Use BASE_TABLE = "base_t" consistently.
The new envd_objects_scalability and cluster_object_limits teardown
paths each open-coded a try/except + print("WARNING: failed to drop ...")
block around their DROP statements. Pull the pattern into a single
helper used by all four call sites.
The four DictWriter blocks in workflow_default repeated almost the same
10-field fieldname list. envd_objects_scalability claimed in a comment
to "reuse the cluster-focused schema" but spelled it out anyway, and
cluster_object_limits redeclared the same list plus a single extra
column. Hoist CLUSTER_FIELDNAMES + ENVD_FIELDNAMES and build all four
writers from them via a small _make_csv_writer helper.
The four analyze_*_results_file functions repeated the same six-line
header: print banner, read CSV, empty check, derive base_name, build
plot_dir, mkdir. Pull it into a helper that returns (df, plot_dir) or
None when the file is empty.
hydrate_and_sample's inner probe_once helper wrapped each probe with
SET cluster=<probe>; <select>; SET cluster=c — three round-trips per
probe. Over a 300s hydration window plus 5 steady-state samples that
adds up to ~900 redundant SETs per N. Move the two SETs to a single
try/finally around the whole polling window; the per-probe work is
now just the lag SELECT.
The "DROP CLUSTER IF EXISTS c CASCADE; CREATE CLUSTER c SIZE ...; SET
cluster = 'c'" sequence appeared verbatim in five run_scenario_*
functions (strong, envd_strong_scaling, envd_objects_scalability,
cluster_object_limits, weak), and the one-row probe-table prep in
three of them. Move both into helpers; the cluster_object_limits
"skip if size unavailable" branch becomes a parameter on the helper
rather than an open-coded try/except at the call site.
…stry

Four workflow_plot_* functions were structurally identical (parser arg,
parse, glob, call one analyzer); the multi-kind workflow_plot did the
same plus a 4-way if/elif suffix dispatch. Replace both shapes with a
shared _plot_files helper that takes either a fixed analyzer
(per-kind workflows) or dispatches via the new SUFFIX_ANALYZERS
registry (the multi-kind workflow_plot). The five workflow_*
functions are now one-call wrappers.
…ario subclasses

The two ClusterObjectLimits scenario subclasses differed only in (a) which
DDL statements create/drop one materialization (CREATE VIEW + CREATE
DEFAULT INDEX vs CREATE MATERIALIZED VIEW), and (b) which catalog table
mz_materialization_lag is joined against. Lift the differences into a
frozen ClusterObjectLimitsKind dataclass carrying create/drop SQL
templates plus the lag-filter join, and have a single
ClusterObjectLimitsScenario class read its kind to drive add_objects /
remove_objects / probe_lag_ms. The two scenarios are now constructed as
ClusterObjectLimitsScenario(CLUSTER_OBJECT_LIMITS_{INDEXES,MVS}_KIND).

EnvdObjectsScalability{Tables,Mvs} are left as separate subclasses: the
MV variant carries pad-cluster sharding state and a distinct init/teardown,
so collapsing them would just hide the structural difference behind
conditionals rather than remove duplication.
…ion_statuses

The freshness probe previously combined "is the dataflow running yet?"
with "is the cluster keeping up?" into a single predicate:

    reporting == N AND max_local_lag_ms < lag_threshold_ms

That meant every unhealthy probe burned the full hydration timeout
(300s in build 1226, see ace6b0f) before declaring failure: the lag
on an overloaded cluster never falls under 2s, so the loop polls to
the deadline and only then captures the lag. In 1226 the 100cc
N=2000, 200cc N=3000, and 1600cc N=2000 probes each sat for ~301s
before recording lag values of 654s–675s. Bisecting an unhealthy
region pays this cost again at every step.

Split into two phases:

  1. Poll `mz_internal.mz_hydration_statuses` until every test object
     on `c` reports `hydrated = true`. This is a definitive per-object
     signal — the dataflow has finished initial snapshotting — and
     converges quickly even on cold-started 1600cc/3200cc replicas.
     Timeout here means the replica is genuinely wedged.

  2. Once hydrated, take the existing `CLUSTER_OBJECT_LIMITS_SAMPLES`
     steady-state lag samples. Unhealthy now means "hydrated but
     can't keep up", which is the property we actually want to
     measure; an overloaded cluster trips the threshold in
     `samples * sample_interval` (~10s) instead of in 300s.

With this decoupling the cold-start argument for the 300s timeout no
longer applies, so drop it back to 60s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dration budget

In build 1227, 3200cc indexes and 1600cc MVs both produced false
cliffs at N=100: the very first probe after CREATE CLUSTER timed out
with 0/100 hydrated, but every subsequent bisect step (N=50/75/87/93)
hydrated cleanly with lag=0.0. The cluster works fine — the replica
just isn't reporting introspection within 60s of being created on
multi-process sizes.

Thread a per-call `timeout_s` into `hydrate_and_sample` and let
`probe_and_record` pick between the regular and a longer "first
probe" budget. The coarse N-walk passes `first_probe=True` only on
its first iteration, so big-replica cold start gets headroom while
every other probe keeps the tight 60s budget that makes unhealthy
points cheap to record.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No behavior changes; just shorter explanations where the original
prose restated the code or expanded beyond what a reader needs:

- MATERIALIZED_ADDITIONAL_SYSTEM_PARAMETER_DEFAULTS rationale: 7 → 3 lines
- _bulk_run docstring: 12 → 5 lines (keep the idempotency warning)
- hydrate_and_sample docstring: 25 → 13 lines (keep the "why split
  the phases" justification)
- probe_lag_ms / probe_hydrated docstrings: drop the tuple-field
  enumeration that duplicates the return type
- collapse the two "framework setup/drop unused" comments
- drop pure-label comments ("Snapshot of cluster sizes", "Outer loop")
  and the init/teardown lines that restate the next statement

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`workflow_default` carried a 150-line if-ladder mapping each `SCENARIO_*`
string to a `run_scenario_*` invocation, plus four parallel sections that
open / upload / archive / analyze one CSV each. Adding a new scenario or
result kind meant touching every one of those.

Replace both with two data registries:

  - `ScenarioSpec` + `SCENARIOS`: name, log label, family, factory lambda,
    groups. `SCENARIOS_BY_NAME` and `SCENARIO_GROUPS` are derived from it,
    so the hand-written `SCENARIOS_CLUSTERD` / `_COMPUTE` / etc. lists go
    away. A `Family` literal + `FAMILY_TO_STREAM` table selects the
    driver, and a small `run_spec` match replaces the if-ladder.

  - `ResultStreamSpec` + `RESULT_STREAMS`: suffix, fieldnames, analyzer,
    uploader. `workflow_default` opens one CSV per spec and the four
    parallel close/upload/artifact/analyze blocks become single loops.
    The old `SUFFIX_ANALYZERS` is now a derived alias.

The scenarios themselves, the four `run_scenario_*` drivers, and the
`Scenario` ABC are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous `Scenario` ABC pretended that all four scenario families
shared a single setup/drop/materialize_views/run lifecycle, but only
strong/weak/envd_qps actually used it. EnvdObjectsScalability and
ClusterObjectLimits returned [] from every ABC method and were driven
through entirely different protocols (init/add_objects/teardown and
reset_for_cluster_size/probe_*/teardown respectively), with comments
apologising that "framework-level setup/drop are unused".

Replace the single ABC with three real shapes:

* `ClusterScalingScenario` (renamed from `Scenario`) for the
  strong/weak/envd_qps families. `drop()` and `materialize_views()` now
  default to `[]` so the envd_qps subclasses no longer need empty
  overrides.
* `EnvdObjectsScalabilityScenario` becomes its own ABC with the methods
  it actually exposes; the unused `replica_size` constructor parameter
  is dropped.
* `ClusterObjectLimitsScenario` becomes a plain class (no inheritance)
  and the unused `replica_size` parameter is dropped.

`ScenarioSpec.factory` now returns the union `AnyScenario`, and
`run_spec` narrows it per-family with isinstance asserts.

With ClusterObjectLimitsScenario no longer pretending to be a generic
Scenario, the 220-line `run_scenario_cluster_object_limits` collapses:
the `hydrate_and_sample` and `probe_and_record` closures and the
N-walk + bisect loop move onto the scenario as `_hydrate_and_sample`,
`_probe_and_record`, and `run_for_cluster_size`. The driver shrinks to
the outer cluster-size loop plus ScenarioRunner construction and
`_recreate_cluster_c` / `reset_for_cluster_size` / `teardown`
bookkeeping.

Behaviour is unchanged: same 13 scenarios, same groups, same CSV
output. Verified by direct module import and pyright/ruff/black.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three parallel scenario hierarchies (ClusterScalingScenario,
EnvdObjectsScalabilityScenario, ClusterObjectLimitsScenario) and five
run_scenario_* drivers collapsed onto one Scenario ABC with a
prepare/scale_points/apply/measure/cleanup_point/teardown lifecycle.
Strong/weak/envd-cpu/envd-objects become thin sweep wrappers around
the existing inner workloads; ClusterObjectLimitsScenario implements
Scenario directly. The result-stream choice now lives on each scenario
via stream_key(), so Family, FAMILY_TO_STREAM, AnyScenario, RunContext
and run_spec all go away.

Also: shared _extend_incremental/_shrink_incremental helpers used by
both envd_objects and cluster_object_limits scenarios, and the four
per-kind workflow_plot_* entries collapse to one workflow_plot that
dispatches by filename suffix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rop dead replica_size param

ClusterObjectLimitsScenario._probe_and_record was the only caller that
bypassed runner.add_result and wrote rows directly via
results_writer.writerow(...), because add_result didn't know about the
extra `healthy` column. Extend add_result with optional `time_ms` (for
values already in ms) and `healthy` kwargs so the cluster_object_limits
path matches every other scenario. The `healthy` column is silently
dropped on streams whose schema doesn't include it via the existing
extrasaction="ignore".

Also drop the `replica_size` constructor parameter from ScenarioRunner:
every sweep wrapper passes None and mutates `runner.replica_size`
in `apply()`. The param was dead weight. _probe_and_record's
`replica_size` argument is dropped for the same reason.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aljoscha and others added 30 commits May 20, 2026 17:34
Add mz_persist_state_apply_calls_by_source_shard_kind IntCounterVec
with labels [source, shard_kind]. Plumb a &'static str `source`
parameter through State::apply_encoded_diffs / apply_diffs /
apply_diff so each call site is attributed. Four runtime sources:

* cas_update  — apply.rs Applier::fetch_and_update_state fast path
* slow_refetch — state_versions.rs::fetch_current_state full replay
* pubsub_push — cache.rs PubSub broadcast intake
* state_iter  — state_versions.rs StateVersionsIter::next walks (GC,
                usage audit, admin inspect)

Used to identify the source of the catalog/txns apply_diff count
explosion at N=10k in the envd-ddl-scalability investigation.
Ran the bench three times with the new source-labeled counter.
Every catalog apply_diff invocation we observe is attributable to
state_iter (StateVersionsIter::next, used by GC). The original
"335/DDL catalog apply_diff" mystery resolves cleanly: exactly one
GC fires during the 100-rep N=10k window and walks ~17,000 live
diffs accumulated on the catalog shard since the previous GC.

Cross-check confirms it: shard_gc_finished{catalog} delta is +1 in
the run-3 N=10k window, shard_gc_live_diffs gauge reads 17,229 at
that GC, and the state_iter source counter delta is also 17,229.
Same number, three independent counters, one event.

Slope impact: ~1.7 ms/DDL of catalog state-apply work (17k calls
times ~10 us each, divided across the 100-rep window). Background
work, so its critical-path contribution is just the Tokio-scheduler
tax — third-tier behind the two CAS-RPC slopes (catalog +7 ms,
txns +4 ms in this run).

Run-to-run variance is high: across three runs, catalog apply_diff
at N=10k ranged from ~0 (no GC during the window) to 172/DDL (one
big GC). When GC didn't fire, this whole component was zero. The
previously-reported "+3.4 ms catalog state_apply slope" was from
a run that happened to coincide with a GC firing — real, but not
a persistent slope component.

The dominant slope remains catalog consensus_cas RPC time
(+1.0-1.9 ms per call x 5.6 calls = +6-11 ms/DDL). Microbench
already ruled out per-shard state size as the cause; next move is
to factor the CAS RPC into Tokio-scheduler-wait vs wire-time, to
confirm shared-runtime contention from the 10k user_data shards.
Add a new mz_persist_consensus_wire_seconds_by_shard_kind histogram
recorded inside BogoConsensus around the inner bogo gRPC client call.
This is the same axes (op + shard_kind, identical buckets) as the
existing mz_persist_external_op_latency_by_shard_kind, but measured
one layer deeper.

The existing external_op_latency_by_kind is recorded by
MetricsConsensus around its run_op wrapper. That sits *inside* the
Tasked spawn and *outside* the bogo gRPC client adapter. Subtracting
the new wire metric from it gives "post-spawn wrapper overhead",
leaving wire = client tonic send + HTTP/2 + server processing +
return.

Wiring: BogoConsensusConfig grows an optional wire_timer callback
(no-op for non-bogo backends, no impact on other consensus impls).
PersistClientCache::open_consensus attaches a closure that derives
shard_kind via the existing classifier and observes into the new
HistogramVec. Other Consensus methods (head/scan/truncate) are
wired up for consistency.
`update_state_metrics` was iterating every `Vec.len()` in the
in-memory BTreeMap on every CAS to recompute `versions_total`. At
10k shards that's ~100 µs of work held under the same `std::sync::
Mutex` that serializes every operation. With ~100 concurrent CAS
in flight from envd's background work, the lock queue depth
explodes and every CAS — catalog, txns, user_data — pays the
queueing cost.

This was eating the entire catalog `consensus_cas` slope we'd been
chasing. With the fix:

  catalog wire mean: 0.77 ms (N=5k) -> 0.29 ms (N=10k)  (was +1.02)
  user_data wire mean: 6.68 ms       -> 0.92 ms          (was +28.3)

Both flat across the scale jump. The "catalog CAS grows with N"
finding was a bench artifact, not a Materialize scaling problem.

Replace with incremental counters: bump `shards_total` when a CAS
creates a new key, bump `versions_total` by +1 on a successful CAS,
decrement by the deletion count on truncate. Mutex is released
before bumping so even the IntGauge.add isn't on the contended
path. Constant time per op, no scan.
Re-ran the bench against CRDB consensus instead of bogo to confirm the
post-fix bogo conclusion holds on a real backend. It does: the
Materialize-side slope (~+15 ms/+5k tables on create p50) reproduces
nearly identically on CRDB. Bogo's flat CAS-mean wasn't hiding a
backend-only problem.

CRDB adds two things on top:
 - a flat ~+28 ms baseline tax from real consensus RPCs (~2 ms/CAS vs
   <0.5 ms on bogo) × ~12 CAS/DDL.
 - a mild secondary CAS-mean slope (catalog 1.88 → 2.11 → 3.80 ms
   across 5k → 10k → 15k); counts are stable so this is per-RPC
   slowdown, plausibly index/plan growth on the consensus table.

Resource ceiling: envd hit 2.9 GiB RSS at N=15k, CRDB stayed under 1
GiB — plenty of headroom on this box, can extend the curve if needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add `mz_catalog_transact_phase_seconds{phase=...}` histogram that
splits each Catalog::transact call into:

  - transact_inner       (super-timer for the inner method)
  - op_loop              (per-op transact_op + preliminary apply)
  - final_apply_updates  (post-loop combined apply on final state)
  - prepare_state        (storage_collections.prepare_state)
  - post_prepare_apply_updates (final apply after prepare_state)
  - tx_commit            (durable tx.commit, the persist CAS path)
  - assign_state         (self.state = new_state)

The metric is owned by the `Coordinator`'s `Metrics` struct and
plumbed into `Catalog` as an `Option<HistogramVec>` via a new
`set_transact_phase_metrics()` setter, called once at coordinator
startup. `transact_incremental_dry_run` doesn't get the metric --
dry-run DDL-txn paths are deliberately excluded so they don't
pollute the foreground numbers.

Motivation: instrument the +9.87 ms/+5k slope on
`catalog_transact_with_ddl_transaction` so we can attribute it to
specific phases. See test/envd-ddl-scalability/NOTES.md (next
commit) for the breakdown.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-ran bogo bench at N=5k/10k/15k with the new
mz_catalog_transact_phase_seconds metric. Inside Catalog::transact
accounts for only ~2.7 ms of the +9.0 ms 5k->10k slope; the other
+6.3 ms is in the Coordinator wrapper layer (Arc::make_mut(catalog),
builtin_table_updates execute, apply_catalog_implications, finalize).
At 10k->15k it's even more lopsided: +4.7 inside vs +10.3 outside.

Inside the inner method, tx_commit is the biggest mover (~half of
the inside-transact slope). prepare_state shows a 7x hockey-stick
from 10k to 15k (0.19 -> 1.41 ms) -- the storage_controller side
warrants its own look. op_loop and final_apply_updates grow modestly,
matching the state-apply attribution from the prior iteration.

Next: instrument the outside wrapper layer
(coord_arc_make_mut, coord_post_transact, etc.) to split that
+6-10 ms/+5k outside slope into named pieces.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extends mz_catalog_transact_phase_seconds with six wrapper-layer
labels so we can split the +6-10 ms/+5k "outside Catalog::transact"
slope:

  - coord_inner_total       (entire method, cross-check)
  - coord_pre_transact      (entry -> just before catalog.transact)
  - coord_arc_make_mut      (just Arc::make_mut(catalog))
  - coord_post_transact     (just after catalog.transact -> return)
  - coord_builtin_table_execute (builtin_table_update().execute())
  - coord_finalize          (the config/tracing finalize block)

`metrics.catalog_transact_phase_seconds` is cloned out before the
existing destructure of `self` so we can keep using `self` for
`builtin_table_update()` etc. afterwards.

Motivation: the prior commit showed +6-10 ms/+5k of slope lives
outside Catalog::transact in the Coordinator wrapper. See
test/envd-ddl-scalability/NOTES.md (next commit) for what the
breakdown actually shows.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-ran the bogo bench at N=5k/10k/15k with wrapper-layer phase
instrumentation. Three confirmations and one ruling-out:

  - coord_builtin_table_execute is the single biggest slope
    component: +4.14 then +4.45 ms per +5k. At N=15k it's 16.85
    ms/DDL, ~29% of total DDL latency.
  - coord_pre_transact is flat (~3.2 ms) across scales -- the op
    pre-walk + validate_resource_limits + write-ts grab don't
    scale.
  - apply_catalog_implications is roughly flat (~11-13 ms) -- big
    but not the slope owner.
  - Arc::make_mut(catalog) is ~0 ms at all scales: the catalog Arc
    is uniquely held in the hot path, so the make_mut clone never
    fires. Earlier hypothesis ruled out.

builtin_table_update().execute() is essentially a call to
Coordinator::group_commit(None).await. The size of
builtin_table_updates per DDL is small (a few rows in mz_objects
etc.), so the growth has to be inside group_commit itself -- next
target is to instrument the upper-advancement / table-append
phases there.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove the loop in `group_commit()` that iterated ALL catalog entries on
every group commit to add empty advancement entries for every table. With
~20k+ tables, this was the dominant DDL bottleneck at ~23% of DDL time.

The txn-wal protocol makes explicit per-table advancement unnecessary:
when any transaction commits to the txns shard, the logical upper of ALL
registered data shards advances automatically, including those not involved
in the transaction. The empty advancement entries were performing no useful
work on the storage side.

Also removed the early-return optimization in PersistTableWriteWorker::append
that skipped the txn-wal commit for empty updates. This ensures periodic
group commits (with no actual data writes) still advance the txns shard
upper, maintaining the property that table logical uppers advance even when
no writes are happening.

Results at ~28k objects (optimized build):
- CREATE TABLE: 374ms → 131ms (65% faster)
- CREATE VIEW: ~96ms
- DROP TABLE: ~97ms
- DROP VIEW: ~86ms

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cute is now flat

Validation bench at N=5k/10k/15k after 5d2d138 (remove O(n) table
advancement loop from group_commit). create_p50 dropped by 5.6 / 9.5 /
14.3 ms across the three scales; per-+5k slope reduction is 38% / 29%.

coord_builtin_table_execute itself is now essentially flat: mean per
call goes 3.80 -> 3.91 -> 4.59 ms (was 8.26 -> 12.40 -> 16.85 pre-fix).
Slope per +5k inside the timer drops from +4.14 / +4.45 ms to
+0.11 / +0.68 ms.

The remaining ~+11 ms-per-+5k headline slope is now spread across
transact_inner, tx_commit, and apply_catalog_implications. Largest
single absolute cost per DDL is apply_catalog_implications at ~11.6 ms
(flat). That's the next investigation target: a sub-phase split to
attribute where the constant 25%-of-DDL cost lives.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a new HistogramVec `mz_apply_catalog_implications_phase_seconds`
labeled by phase. apply_catalog_implications had only a single outer
timer (~11.6 ms per call regardless of N) so we could not tell what
fraction of that constant cost lives in each region.

Phases captured:
* absorb_updates — the implication batching loop at the top of
  apply_catalog_implications
* inner_total — the whole call into apply_catalog_implications_inner
  (split below)
* inner_item_loop — the `for (catalog_id, implication) in implications`
  loop that walks per-item implications
* inner_cluster_loops — the cluster + cluster_replica command loops
* inner_controller_setup — the post-loop calls into the controllers:
  create_source_collections, create_table_collections,
  initialize_storage_collections, vpc_endpoints, alter_*
* inner_dependency_scan — active-sink/peek/copy cleanup + timeline
  association rebuilds
* inner_finalize — the "no error returns allowed" async block: drops,
  retires, background secret/replication-slot cleanup

Same pattern as `mz_catalog_transact_phase_seconds`: histogram cloned
locally, RAII timers via `start_timer()`. No behavioral change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cations

Sub-phase split of mz_apply_catalog_implications_seconds at
N=5k/10k/15k. inner_controller_setup (i.e. create_table_collections
+ initialize_storage_collections, etc.) is 84% of inner_total and
carries ~93% of the slope.

CREATE-only inner_controller_setup is 16.32 / 17.13 / 19.70 ms
across scales. That's the single call into
controller.storage.create_collections opening a fresh persist
WriteHandle + SinceHandle for the new table shard, then a
compare_and_downgrade_since on it.

DROP-only inner_finalize is ~3.6 ms — drop_tables -> txn-wal
append. Mostly flat.

Everything else (absorb_updates, item loop, cluster loops,
dependency scan) is microseconds in this workload and will not
matter until we have real user-cluster activity.

Next: split create_table_collections / create_collections into
phases (write-ts grab, advance_upper, open handles,
downgrade-since, install collection state) to attribute the 20 ms
CREATE cost.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add four sub-phase labels to mz_apply_catalog_implications_phase_seconds:
- create_table_write_ts (get_local_write_ts)
- create_table_advance_upper (catalog.advance_upper)
- create_table_storage_create_collections (controller.storage.create_collections)
- create_table_apply_local_write (apply_local_write)

inner_controller_setup is 16+ ms per CREATE at N=15k and carries the
dominant slope inside apply_catalog_implications. This split tells us
which of the four steps is responsible.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sub-phase split of create_table_collections at N=5k/10k/15k. Inside
apply_catalog_implications's inner_controller_setup (16-22 ms per
CREATE), the four named steps break down as:

- storage.create_collections: 8.16 -> 8.98 -> 11.57 ms (slope +0.82 / +2.59)
- get_local_write_ts:         3.27 -> 3.62 -> 3.61 ms (flat ~3.6)
- apply_local_write:          2.88 -> 3.12 -> 3.29 ms (flat ~3)
- catalog.advance_upper:      0.56 -> 0.61 -> 0.71 ms (small slope)

storage.create_collections is both the dominant absolute cost AND the
dominant slope owner (74% of inner_controller_setup slope). Next
target: split storage_collections::create_collections_for_bootstrap
into open_data_handles, compare_and_downgrade_since, and the
collection-install loop. Existing info_span!s mark the boundaries.

Secondary finding: CREATE TABLE makes two get_local_write_ts calls
(one in catalog_transact_inner, one in create_table_collections),
each ~3 ms. Worth a design review of whether the second is required.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…strap

Split storage.create_collections (the slope owner inside CREATE TABLE
controller setup, identified in commit 7491526) into two layers of
sub-phases:

storage_collections.create_collections_for_bootstrap:
 - validate_and_enrich
 - open_persist_client
 - open_data_handles_concurrent (buffer_unordered 50)
 - sort
 - install_collection_states (under collections mutex)
 - synchronize_finalized_shards

storage_controller.create_collections_for_bootstrap:
 - storage_collections_call (inner call to the above)
 - validate_and_enrich
 - open_persist_client
 - open_data_handles_concurrent (the second buffer_unordered, controller's
   own per-collection write handle)
 - register_loop (per-collection for loop with acquire_read_holds)
 - init_source_statistics
 - table_register (persist_table_worker.register batched call)
 - append_shard_mappings
 - run_to_execute

Two new HistogramVecs registered:
 - mz_storage_collections_create_collections_phase_seconds
 - mz_storage_controller_create_collections_phase_seconds

Both keyed on phase label. Will use to attribute the 11.57 ms / +2.59 ms
per +5k slope inside controller.storage.create_collections at N=15k.
… is in Catalog::transact

Phase 7 ran the new mz_storage_collections / mz_storage_controller
create_collections_phase_seconds histograms at N=5k/10k/15k. Findings:

1. storage.create_collections is flat at ~9 ms per CREATE TABLE across
   all three scales. The phase 6 measurement (11.57 ms at N=15k with
   +2.59 ms per +5k slope) was run-to-run variance on a single data
   point, not a real slope.
2. Inside storage_collections::create_collections_for_bootstrap the
   only sub-phase with a visible slope is install_collection_states
   (0.19 -> 0.53 -> 1.54 ms). That's the post-stream loop running under
   the collections mutex; the work is BTreeMap inserts plus channel
   sends. Probably not worth fixing at current N.
3. open_data_handles_concurrent (the buffer_unordered stream that opens
   the SinceHandle + WriteHandle pair) is the biggest absolute cost at
   ~5 ms/CREATE but it's flat. The per-shard persist work doesn't grow
   with the number of already-registered shards.
4. apply_catalog_implications.inner_controller_setup is also flat now
   (9.43 -> 9.44 -> 9.72 ms). Phase 5's slope claim was either fixed
   between then and now, or also variance on a single N=15k point.

Where the slope actually lives now: Catalog::transact, specifically
transact_inner (+2.5 ms/+5k), tx_commit (+1.5 ms/+5k), op_loop
(+0.8 ms/+5k), and the apply_updates family (+1.8 ms/+5k combined).
coord_inner_total is +5.48 ms per +5k tables per call; doubled across
CREATE+DROP that's +11 ms/+5k per DDL, which fully accounts for the
observed create_p50 slope of +9-10 ms/+5k.

Next iteration target: instrument tx_commit and the StateUpdate
appliers inside Catalog::transact.
Adds two HistogramVecs to drill into tx_commit, which phase 7 attributed
+1.51 ms / +5k slope to:

  mz_catalog_commit_transaction_phase_seconds{phase}
    - caa_fence_check
    - caa_encode
    - caa_persist_caa_inner    (the inner compare_and_append_inner wrapper)
    - caa_persist_compare_and_append (the actual persist write_handle.CaA)
    - caa_since_downgrade      (since handle maybe_compare_and_downgrade_since)
    - caa_post_sync            (the sync(next_upper) call after CaA)

  mz_catalog_sync_phase_seconds{phase}
    - listen_fetch     (listen.fetch_next, summed across iterations)
    - apply_updates    (apply_updates, summed across timestamps)
    - consolidate      (maybe_consolidate + final consolidate)

sync_phase_seconds aggregates across all iterations within a single
sync_inner call, so each tx_commit produces one sample per phase, not
one per listen event.

Will pin down whether the +1.51 ms / +5k tx_commit slope lives in the
persist CaA, the post-CaA sync, or the in-memory consolidate.
Phase 8 attributed the +7 ms/+5k slope in tx_commit at N=5k -> N=10k
to mz_catalog_sync_phase_seconds{phase="consolidate"}:

  N=5k:   0.64 ms/call x 6 calls/DDL = 3.82 ms/DDL
  N=10k:  1.87 ms/call x 6 calls/DDL = 11.20 ms/DDL
  N=15k:  2.39 ms/call x 6 calls/DDL = 14.37 ms/DDL

Root cause: sync_inner ended with an unconditional consolidate(), which
does O(N log N) work on the entire snapshot. For the steady-state case
(one timestamp per sync, ~5-10 new entries added to a 15k-entry snapshot)
this ran on every DDL commit, paying the full sort + dedup cost for a
trivial delta.

The doubling-threshold maybe_consolidate (added in MaterializeInc#36233) was supposed
to amortize this, but it never triggered: sync_inner reset
size_at_last_consolidation to None at the top of every call, so the
threshold was always re-baselined against the current snapshot size.
A single DDL never grows the snapshot by 2x, so maybe_consolidate
inside the loop did nothing — and then the unconditional consolidate
at the end paid the full O(N log N) cost.

Two changes:
 1. Drop the size_at_last_consolidation = None reset at the top of
    sync_inner. The doubling threshold is meant to amortize across the
    snapshot's lifetime, not per sync_inner invocation.
 2. Replace the unconditional self.consolidate() at the end with
    self.maybe_consolidate(). Combined with the per-ts maybe_consolidate
    inside the loop, this keeps memory bounded at 2x the last
    consolidated size while making per-call cost amortized O(log N)
    instead of O(N log N).

Verified existing tests still pass:
  test_persist_sync_consolidation_not_quadratic     ok
  test_persist_sync_snapshot_stays_bounded_under_churn  ok

The "stays bounded under churn" test (200 renames of one DB) still passes
because the persistent threshold + per-ts maybe_consolidate fires every
~log(N) steps. The "not quadratic" test still passes because total
consolidations during a 100-ts sync stay well under the test's bound of 10.
Phase 8 split tx_commit into commit_transaction phases + sync phases.
Found mz_catalog_sync_phase_seconds{phase="consolidate"} at:
  N=5k:  3.82 ms/DDL
  N=10k: 11.20 ms/DDL
  N=15k: 14.37 ms/DDL

Phase 9 (post-fix in commit 00d31c5) shows consolidate flat at 0 ms
across all N, tx_commit per-call flat at ~1.13 ms (was 2.92 -> 5.54).
create_p50 at N=15k dropped from 48.93 to 44.51 ms (-4.42 ms);
slope at 10k->15k dropped from +9.11 to +5.81 ms/+5k (35% reduction).

The residual slope has moved entirely to in-memory state-apply paths
(transact_inner outer, op_loop, apply_updates family). Next iteration
target: profile CatalogState::apply_updates for the per-update walk
that still scales with N.
Phase 9 confirmed the catalog `consolidate` slope is gone. The residual
+5.81 ms/+5k slope at 10k→15k is still inside Catalog::transact_inner,
specifically in `apply_updates` and family. We have no visibility into
which sub-step of apply_updates owns that slope.

Add two new histograms:

  * mz_catalog_apply_updates_phase_seconds{phase}
      One observation per apply_updates call, per sub-phase:
        - consolidate_initial (the per-call consolidate_updates)
        - sort_per_group       (sort_updates per timestamp group)
        - apply_updates_inner  (the kind-dispatch loop)
        - cleanup_notices      (drop_optimizer_notices + pack)

  * mz_catalog_apply_update_kind_seconds{kind}
      One observation per StateUpdate inside apply_updates_inner,
      labeled by StateUpdateKind variant (item, schema,
      storage_collection_metadata, etc.). The 1e-5 lower bucket lets us
      see individual sub-microsecond updates so per-kind contributions
      add up correctly across hundreds of updates per DDL.

Strong working hypothesis for the slope: `Arc::make_mut` on
`storage_metadata` (whose inner `collection_metadata` is a non-persistent
`BTreeMap<GlobalId, ShardId>`) does a full O(N) clone every time the
Arc is shared between `preliminary_state` and `state` Cow's — which
happens twice per DDL through the op_loop + final_apply_updates path.
At N=15k that's ~5 ms of pure clone work, matching the observed slope.

The next phase will run the bench and confirm or refute this from the
per-kind data.
Phase 10's per-kind apply_updates instrumentation pointed at
`StateUpdateKind::Item` as the dominant slope owner (+850 us per call
per +10k tables, on 4 calls per DDL). The work that scales with N is
inside `apply_item_update` / `insert_entry` / `drop_item`, which call
`self.get_schema_mut(...)` to walk
`database_by_id.get_mut(...).schemas_by_id.get_mut(...)`.

`schemas_by_id` is `imbl::OrdMap<SchemaId, Schema>`, which is shared
between `preliminary_state` and `state` Cow's in `transact_inner`.
`get_mut` on a shared `imbl::OrdMap` path-copies the affected B-tree
leaf and **clones every value in the leaf**, not just the targeted
one. Schema embedded three non-persistent maps:

  pub items: BTreeMap<String, CatalogItemId>,
  pub functions: BTreeMap<String, CatalogItemId>,
  pub types: BTreeMap<String, CatalogItemId>,

At N=15k the audit_pad schema's `items` has 15k entries. The B-tree
leaf containing the materialize-database schemas almost certainly fits
in one imbl chunk, so any `apply_item_update` (even mutating
audit_meas, not audit_pad) leaf-copies audit_pad's Schema and clones
its 15k-entry BTreeMap. That's the O(N) memcpy+tree-build per call
that the phase 10 per-kind metric attributed to `item`.

Switching items/functions/types to `imbl::OrdMap<String, CatalogItemId>`
makes the per-leaf-clone path O(1) (refcount bump on a persistent
tree root). All call sites use only operations common to both types
(get / insert / remove / contains_key / is_empty / len / values /
iter), so it's a drop-in. The fn pointer signature in
`CatalogState::resolve` is updated to match.

Phase 11 bench follow-up will validate that `item` mean per call
flattens with N.
…ections

Phase 10's per-kind apply_updates instrumentation showed
`StateUpdateKind::StorageCollectionMetadata` had a clean
+200 us/call slope at +5k tables (2 calls per DDL, ~+0.4 ms/DDL).

`StorageMetadata` lives behind `Arc<StorageMetadata>` on `CatalogState`.
The `preliminary_state`/`state` Cow pattern in `Catalog::transact_inner`
shares this Arc, so the first `Arc::make_mut(storage_metadata)` per
independently-owned `CatalogState` deep-clones its fields:

  pub collection_metadata: BTreeMap<GlobalId, ShardId>,
  pub unfinalized_shards: BTreeSet<ShardId>,

With N=15k tables, `collection_metadata` has ~15k entries, and the
`BTreeMap` clone is O(N). At 2 make_mut'd `CatalogState`s per DDL,
that's two full clones per DDL.

Switch both to imbl::OrdMap / imbl::OrdSet so the clone is O(1)
(persistent tree refcount bump). All external callers use only
operations common to both (.get(), .contains(), .iter()). The only
local API divergence: `imbl::OrdSet::insert/remove` return
`Option<T>`, not `bool` — `apply_unfinalized_shard_update` is
adjusted accordingly.

Same reasoning as ad197b0 (Schema.items/functions/types), applied
to the storage-side analogue.
Phase 10 instrumentation (00025cb) split apply_updates into
sub-phases + per-StateUpdateKind, attributing the +1.94 ms/+5k
apply_updates_inner slope at 10k→15k to two non-persistent
collections living behind shared imbl/Arc handles:

  1. Schema.items/functions/types: BTreeMap<String, CatalogItemId>
     — cloned via imbl::OrdMap leaf path-copy in get_schema_mut.
     At N=15k the audit_pad schema's items had 15k entries; every
     apply_item_update clone cost O(N).
  2. StorageMetadata.{collection_metadata, unfinalized_shards}:
     BTreeMap/BTreeSet behind Arc<StorageMetadata>. The
     preliminary_state/state Cow split in transact_inner forces
     Arc::make_mut to deep-clone these on each owned CatalogState.

Phase 11 (ad197b0) and phase 12 (4b6f5d1) swap both to
their imbl persistent counterparts.

Results at N=15k:
 - item kind: 1436 -> 242 us/call (-83%)
 - storage_collection_metadata kind: 567 -> 5.12 us/call (-99%)
 - apply_updates_inner slope (10->15k): +1.94 -> +0.18 ms/DDL
 - create_p50 (15k): 48.93 (phase 7) -> 42.84 (phase 12), -6.09 ms

NOTES.md captures the per-phase tables and writes up the
generalizable pattern: inline value types stored inside an
imbl::OrdMap (or behind shared Arc) silently lose the O(1)
clone property to any non-persistent sub-collection field.
…ue pattern

Read-only audit of every imbl::OrdMap in the catalog/controller hot
paths, looking for the same "outer is persistent, inner is not"
shape that drove phases 11 and 12.

Findings, ranked:

HIGH (same shape, same hot path, drop-in fix):
 - Database.{schemas_by_id, schemas_by_name}: BTreeMap inside
   Cluster Cluster -> wrong, Database value held in
   imbl::OrdMap<DatabaseId, Database>. get_schema_mut is on the
   every-apply_item_update path. Worth landing next.

MEDIUM (workload-dependent, not exercised by the audit_pad bench):
 - Cluster.bound_objects: BTreeSet<CatalogItemId> grows with every
   object bound to that cluster. Invisible in this bench (plain
   tables don't bind to a cluster), but expected to slope under MV /
   index workloads on a single cluster.
 - Cluster.replica_id_by_name_ / replicas_by_id_ / log_indexes:
   small in practice.

LOW (small or workload-specific):
 - Role.{vars.map, membership.map}: per-role counts are small.
 - SourceReferences.references: grows with one source's refs, not N.
 - CatalogEntry.{referenced_by, used_by}: small per entry, but the
   16-entry imbl leaf clone of entry_by_id also re-clones
   CatalogItem (incl. optimized/physical plans for MV/Index/CT) for
   sibling entries; matters under MV scale, not table scale.
 - notices_by_dep_id: value is Vec<Arc<_>>, shallow clone is cheap.

Recommended next step: land the Database BTreeMap -> imbl::OrdMap
swap, then design a real MV/index scale bench before considering
the MEDIUM tier. The existing
mz_catalog_apply_update_kind_seconds{kind} histogram is the
canonical signal for whether each tier is worth fixing.
…istent collections

Phase 11+12 fixed the two non-persistent inner collections that the
audit_pad bench exposed (Schema.items and StorageMetadata.collection_metadata).
This commit applies the same fix to the remaining sites identified
by the read-only sweep in 351ddb9:

  Database.{schemas_by_id, schemas_by_name}: BTreeMap -> imbl::OrdMap
  Cluster.{bound_objects, replica_id_by_name_, replicas_by_id_}:
    BTreeMap/BTreeSet -> imbl::OrdMap/imbl::OrdSet
  RoleMembership.map: BTreeMap -> imbl::OrdMap
  RoleVars.map: BTreeMap -> imbl::OrdMap
  SourceReferences.references: Vec -> imbl::Vector

All of these live inside imbl::OrdMap<K, V> on CatalogState (or
transitively inside such a value type), so the
preliminary_state/state Cow split in Catalog::transact_inner forces
imbl leaf path-copies to deep-clone them. The audit_pad bench
doesn't exercise these paths (plain tables don't touch clusters /
roles / sources), but the same shape that produced the +5 ms/DDL
slopes in phases 11+12 would surface on cluster-heavy / role-heavy
workloads.

Two intentional non-changes:

  * Cluster.log_indexes stays BTreeMap<LogVariant, GlobalId>: tight
    API contract with the compute controller (arranged_logs:
    BTreeMap<...>) and bounded by ~10 log variants.
  * CatalogEntry.{referenced_by, used_by} stay Vec<CatalogItemId>:
    the mz_sql::catalog::CatalogItem trait returns &[CatalogItemId]
    for them, which imbl::Vector can't satisfy (no Deref<[T]>).
    Per-entry counts are small and the dominant per-entry clone
    cost during entry_by_id leaf path-copy is item: CatalogItem,
    not these vectors.

Trait signatures in mz_sql::catalog::{CatalogDatabase, CatalogRole,
CatalogCluster} change to match (&imbl::OrdMap / &imbl::OrdSet).

The per-field "why imbl" comments on Schema and StorageMetadata are
removed and consolidated into a single rule-block doc comment on
CatalogState explaining the pattern, the two slopes that motivated
it, and the rule for new fields. Two intentional holdouts are
called out there as well.
…on tables

The phase 13 sweep (1a8446c) switched five more inline-in-imbl::OrdMap
collections off of BTreeMap/BTreeSet/Vec, but the audit_pad bench
(plain CREATE/DROP TABLE) doesn't exercise the cluster, role, or
source paths — so the expected outcome was "no measurable change,
no regression." That's what we got.

item @ N=15k: 242 -> 238 us/call. storage_collection_metadata
unchanged at ~5 us/call. apply_updates_inner total at N=15k flat
at 1.48 ms/DDL. Run-to-run noise dominates.

The sweep is landmine-prevention for cluster-heavy / role-heavy /
source-heavy workloads where the same leaf-clone pattern would
surface a slope on those DDL kinds. A future scale bench targeting
CREATE INDEX / MV / GRANT is the right way to actually measure
those payoffs; the existing per-kind histogram is the canonical
signal.
Adds mz_storage_collections_prepare_state_phase_seconds{phase} so we can
attribute the prepare_state slope (currently 2.4 ms/call at N=15k user
tables, growing roughly linearly) to its sub-phases.

Sub-phases:
 - insert_add, insert_register, delete: txn mutations.
 - dropped_shard_lookup: self.collections.lock() acquire + the loop
   over dropped_mappings.
 - insert_unfinalized: txn.insert_unfinalized_shards.
 - mark_finalized: self.finalized_shards.lock() + txn.mark_shards_as_finalized.

Companion to mz_catalog_transact_phase_seconds (outer) and
mz_apply_catalog_implications_phase_seconds.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the txns shard upper advances, BackgroundTask::run propagates the
new upper to every txns-backed user table by calling
self.update_write_frontiers(...) with one entry per table. The previous
implementation held self.collections (the single global mutex shared
with every DDL path, including catalog prepare_state) for the full O(N)
walk.

At N=15k user tables that meant one txns-upper tick = ~10+ ms of held
mutex, plus the BackgroundTask staying CPU-bound while it held it.

Phase-level bench (mz_storage_collections_prepare_state_phase_seconds)
attributed the slope to exactly this: dropped_shard_lookup (which only
does self.collections.lock() + an empty for-loop in our CREATE TABLE
workload) went from 1.6 µs/call at N=5k to 599 µs/call at N=10k,
matching the OUTER prepare_state mean (24 -> 625 µs).

Process updates in chunks of 256, releasing the lock and yielding the
task between chunks. The fix mostly helps other phases that compete
with the BackgroundTask for CPU and lock access: at N=10k the
end-to-end create_p50 drops by 11 ms (63.18 -> 52.04), driven by:
 - coord_builtin_table_execute  -3.58 ms/DDL
 - coord_pre_transact           -1.87 ms/DDL
 - tx_commit                    -1.56 ms/DDL
prepare_state itself gets slightly worse (the per-chunk lock dance
costs a few µs and the BackgroundTask gets to re-win the lock more
often), but the BackgroundTask no longer monopolizes either lock or
CPU for the whole walk, and the DDL pipeline catches up faster.

A future architectural fix would store the shared txns upper in one
place on StorageCollectionsImpl and skip the O(N) propagation
entirely; that's larger and touches every reader of write_frontier
on a txns-backed collection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
NOTES writeup for phase 14:
- mz_storage_collections_prepare_state_phase_seconds attributed the
  prepare_state slope to dropped_shard_lookup (self.collections.lock
  acquire), confirming BackgroundTask::update_write_frontiers as the
  contention source.
- The chunked unlock in update_write_frontiers ended up being a CPU
  yield win, not the lock-wait win the source comment suggested:
  prepare_state's own contention got slightly worse (lock barging
  by the BackgroundTask), but other phases that compete with it for
  CPU caught up much faster.
- Net same-run effect at N=10k: create_p50 63.18 -> 52.04 (-11 ms),
  coord_inner_total 37.59 -> 29.52 (-8 ms).
- prepare_state slope is not flat; a future architectural pass needs
  to store the shared txns upper in one field rather than fanning it
  out to every txns-backed collection's write_frontier on every tick.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant