DNM: envd scalability work#36634
Draft
aljoscha wants to merge 102 commits into
Draft
Conversation
The existing scenarios scale cluster size or envd CPU cores -- nothing
measures how adapter/envd latency moves as the catalog itself grows. Add
two scenarios under a new `envd_scalability` group that fix the
measurement cluster and vary the number of catalog objects.
`envd_scalability_tables` puts N empty tables in the catalog -- pure
catalog/adapter pressure, no controller load. `envd_scalability_mvs`
does N materialized views over a single 1-row base table -- same
catalog footprint, plus controller load proportional to N. The MV
scenario shards across single-replica pad clusters at 10000 MVs per
cluster (so 100k MVs spans 10 clusters), since one cluster can't
reasonably host that many dataflows.
For each N in {1, 10, 100, 1k, 3k, 5k, 10k, 20k, 30k, 50k, 100k} we run
10 reps each of `CREATE TABLE` (DDL through the coordinator) and
`SELECT * FROM <1-row table>` (a simple peek on a fixed 100cc cluster).
The catalog is built incrementally across size points, so going from
N=k to the next size point only adds (next - k) objects -- otherwise
we'd pay an O(sizes * N) build cost. The size list is overridable via
`--envd-scalability-sizes` for scaffolding runs.
Results land in a third CSV (`*.envd_scalability.csv`) reusing the
cluster CSV schema; `mode='envd_scalability'` distinguishes the rows.
Test analytics rides on the existing `cluster_spec_sheet_result` table
-- no schema change needed. The analyzer plots `time_ms` vs N per
(scenario, category, test_name).
This is going to be long-running, especially the MV scenario where each
create exercises the controller -- expect hours for the full size
range.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two new scenarios -- cluster_object_limits_indexes and cluster_object_limits_mvs -- that find, per cluster size, the maximum number of idle materializations one cluster can keep fresh. The materializations are derived from a one-row, never-updated base table so the only work the cluster has to do is keep advancing each materialization's write_frontier in step with the upstream table. Once the cluster can't keep up, freshness collapses; the driver records the largest N at which `max(local_lag) < 2s` was still achievable, with the unhealthy data point recorded too so the cliff is visible. Staging-only (rejects --target=cloud-production), to avoid burning production resources on long object-limit searches.
…lability default at 50k When a materialization stalls completely (write_frontier never advances past the minimum timestamp), `mz_internal.mz_materialization_lag` reports `now() - 0` = current unix time in ms (~1.78e12). Recorded as-is this crushes every healthy data point to ~0 on the plot. Cap the recorded value at 10x the healthy threshold (= 20 s), preserve the underlying truth via the `healthy` column, and label the plot to make the cap and healthy threshold explicit. Also drop 100_000 from the envd_scalability default size list: 50_000 is a more sensible default ceiling for staging. The full size list is still override-able via --envd-scalability-sizes for ad-hoc runs.
…tion The release-qualification pipeline already runs three cluster-spec-sheet groups (cluster_compute on production, source_ingestion on production, environmentd on staging). Add two more groups -- envd_scalability and cluster_object_limits -- both running against staging, since both push the catalog / cluster to limits we don't want to exercise on production.
The three "envd / cluster" groups in the cluster-spec-sheet were named inconsistently. Settle on the three concept names the cluster-spec-sheet effort uses verbally: environmentd -> envd_qps_scalability (QPS vs envd CPU) envd_scalability -> envd_objects_scalability (latency vs catalog N) cluster_object_limits -> cluster_object_limits (unchanged) Renames apply to: scenario constants, scenario-name string values, group keys in SCENARIO_GROUPS, class names, the run/analyze function names, the --envd-scalability-sizes CLI flag, the result CSV suffix, and the `mode` field written into CSV rows. The pre-existing QPS scenarios keep their individual `*_envd_strong_scaling` names since only the group is renamed. Also updates the release-qualification pipeline step ids/args and the README to match.
…w start When debugging cluster-spec-sheet runs on staging it's hard to tell which environment we're actually talking to and whether the system parameter defaults we expect (lifted via LaunchDarkly or similar) are actually applied. Add a one-shot diagnostic right after target.initialize() that prints mz_environment_id() and SHOWs the limits the test depends on (max_tables, max_materialized_views, max_objects_per_schema, max_clusters, max_credit_consumption_rate, memory_limiter_interval). Best-effort: any probe error is logged and swallowed so a transient failure does not abort the workflow.
psycopg3's execute() requires a LiteralString, so the f-string SHOW query tripped pyright in CI. Compose the statement with psycopg.sql.SQL/Identifier instead, matching the pattern already used in test/orchestratord/mzcompose.py.
A staging run of `envd_objects_scalability_mvs` (release-qualification
build 1219) aborted at N≈19800 with:
Retryable error: consuming input failed: SSL error: unexpected eof
while reading, reconnecting...
psycopg.errors.InternalError_: materialized view
"materialize.pad_schema.pad_mv_19805" already exists
The TLS connection dropped mid-statement; envd had already committed the
CREATE but the response was lost. ConnectionHandler.retryable reconnects
and replays the same statement, which then fails with "already exists".
Use ``CREATE ... IF NOT EXISTS`` for every CREATE issued via _bulk_run so
the retry is a no-op. Affects the bulk-creation paths in both
envd_objects_scalability scenarios (tables, MVs) and both
cluster_object_limits scenarios (indexed views, MVs). Add a docstring on
_bulk_run spelling out the idempotency requirement so future CREATEs
don't reintroduce the hazard.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 50k scale point pushes a single Envd Objects Scalability run past the 13-hour mark on staging — adapter latency degrades so much by then that each measurement repetition takes several seconds, and the catalog build itself runs at <1/s. 30k is where the interesting signal already lives. Drop 50k from the default list; ad-hoc runs that want it can still pass --envd-objects-scalability-sizes explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cluster_object_limits N-list defaults to a +1k linear step past N=1000, which is too coarse: a run on staging showed the cliff sits in (1000, 2000] for cluster_object_limits_indexes across every cluster size 100cc..1600cc, and we can't tell from that whether the limit is 1100 or 1900. After the coarse N-walk hits its first unhealthy point, bisect the (last_healthy, first_unhealthy) interval --cluster-object-limits-bisect- steps times (default 4) and probe each midpoint. The bisection step adds or drops objects in place — never rebuilds the catalog — so the cost is only ~bisect_steps extra hydrate-and-probe rounds per cluster size. With the default 4 steps, the cliff narrows to ±~60 objects. Adds: - `remove_objects(target_n)` symmetric to `add_objects(target_n)` on both ClusterObjectLimitsScenario subclasses. Indexes scenario drops via DROP VIEW ... CASCADE (cascades to the default index); MVs scenario drops via DROP MATERIALIZED VIEW. - `--cluster-object-limits-bisect-steps` CLI flag plumbed through to `run_scenario_cluster_object_limits`. - Bisection block in the per-cluster-size loop that calls add+remove (one is a no-op) and records each probe under the same CSV schema, so the existing freshness-lag-vs-N plot just gets denser near the cliff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If CREATE CLUSTER fails for a cluster_object_limits size — because the target region doesn't expose that replica size, or because allocating the cluster would exceed max_credit_consumption_rate — today the scenario either aborts with a traceback or (when the cluster is created but then can't actually keep up) reports a confusing "unhealthy at N=100" data point. Catch psycopg.errors.DatabaseError around the CREATE CLUSTER, log a clear "size unavailable" line (with the underlying error class + message), and `continue` to the next cluster size. OperationalError is re-raised so that genuine connection failures (which run_query's retry loop has already given up on) aren't silently masked as a size problem. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cluster_object_limits plots (max-healthy-N bars + lag-vs-N legend) ordered cluster sizes alphanumerically — "100cc, 1600cc, 200cc, 3200cc, 400cc, 800cc" — making the small→large progression unreadable. Lift the existing `extract_cluster_size` helper to module scope and use it to reindex the bar plot's index and reorder the line plot's columns. The cluster-results path was already using it for its x-axis, so the extraction is just hoisted, not duplicated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In build 1223 (release-qualification), 1600cc/3200cc showed implausibly small max-healthy-N — 1600cc reported 0 healthy indexes / 93 healthy MVs where 100cc–800cc routinely handled 1500+ indexes and 687+ MVs. Probing the first N (=100) on a freshly-created 1600cc cluster returned local_lag values of 90+ seconds for indexes and ~unix-epoch-ms for MVs (i.e. write_frontier stuck at zero). Once that first probe declared the cluster unhealthy, the bisect could not recover: each successive sample measured more accumulated lag, not less, because the cluster never got a chance to settle. Likely cause is cold-start: bigger replicas take longer to begin serving frontiers after CREATE CLUSTER + bulk DDL, and the 60s hydration window expires before steady state. Bump it to 300s as a first diagnostic — if 1600cc/3200cc now look healthy at reasonable N, this confirms the hypothesis and we can keep the higher timeout (or make it size-dependent). If they still look broken, the issue is elsewhere (provisioning, multi-process replica semantics, etc.). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ult cap The envd_objects_scalability default cap was reduced from 100k to 30k in bdb6607 but two comments still referenced the older shape. Update the system-parameter rationale to describe the headroom relationship between the lifted ceilings (200k) and the user-configurable cap (default 30k), and update the analyzer docstring to match.
…base_t EnvdObjectsScalabilityMvsScenario called its one-row base table pad_schema.pad_base while ClusterObjectLimitsScenario called the same shape table pad_schema.base_t. Use BASE_TABLE = "base_t" consistently.
The new envd_objects_scalability and cluster_object_limits teardown
paths each open-coded a try/except + print("WARNING: failed to drop ...")
block around their DROP statements. Pull the pattern into a single
helper used by all four call sites.
The four DictWriter blocks in workflow_default repeated almost the same 10-field fieldname list. envd_objects_scalability claimed in a comment to "reuse the cluster-focused schema" but spelled it out anyway, and cluster_object_limits redeclared the same list plus a single extra column. Hoist CLUSTER_FIELDNAMES + ENVD_FIELDNAMES and build all four writers from them via a small _make_csv_writer helper.
The four analyze_*_results_file functions repeated the same six-line header: print banner, read CSV, empty check, derive base_name, build plot_dir, mkdir. Pull it into a helper that returns (df, plot_dir) or None when the file is empty.
hydrate_and_sample's inner probe_once helper wrapped each probe with SET cluster=<probe>; <select>; SET cluster=c — three round-trips per probe. Over a 300s hydration window plus 5 steady-state samples that adds up to ~900 redundant SETs per N. Move the two SETs to a single try/finally around the whole polling window; the per-probe work is now just the lag SELECT.
The "DROP CLUSTER IF EXISTS c CASCADE; CREATE CLUSTER c SIZE ...; SET cluster = 'c'" sequence appeared verbatim in five run_scenario_* functions (strong, envd_strong_scaling, envd_objects_scalability, cluster_object_limits, weak), and the one-row probe-table prep in three of them. Move both into helpers; the cluster_object_limits "skip if size unavailable" branch becomes a parameter on the helper rather than an open-coded try/except at the call site.
…stry Four workflow_plot_* functions were structurally identical (parser arg, parse, glob, call one analyzer); the multi-kind workflow_plot did the same plus a 4-way if/elif suffix dispatch. Replace both shapes with a shared _plot_files helper that takes either a fixed analyzer (per-kind workflows) or dispatches via the new SUFFIX_ANALYZERS registry (the multi-kind workflow_plot). The five workflow_* functions are now one-call wrappers.
…ario subclasses
The two ClusterObjectLimits scenario subclasses differed only in (a) which
DDL statements create/drop one materialization (CREATE VIEW + CREATE
DEFAULT INDEX vs CREATE MATERIALIZED VIEW), and (b) which catalog table
mz_materialization_lag is joined against. Lift the differences into a
frozen ClusterObjectLimitsKind dataclass carrying create/drop SQL
templates plus the lag-filter join, and have a single
ClusterObjectLimitsScenario class read its kind to drive add_objects /
remove_objects / probe_lag_ms. The two scenarios are now constructed as
ClusterObjectLimitsScenario(CLUSTER_OBJECT_LIMITS_{INDEXES,MVS}_KIND).
EnvdObjectsScalability{Tables,Mvs} are left as separate subclasses: the
MV variant carries pad-cluster sharding state and a distinct init/teardown,
so collapsing them would just hide the structural difference behind
conditionals rather than remove duplication.
…ion_statuses
The freshness probe previously combined "is the dataflow running yet?"
with "is the cluster keeping up?" into a single predicate:
reporting == N AND max_local_lag_ms < lag_threshold_ms
That meant every unhealthy probe burned the full hydration timeout
(300s in build 1226, see ace6b0f) before declaring failure: the lag
on an overloaded cluster never falls under 2s, so the loop polls to
the deadline and only then captures the lag. In 1226 the 100cc
N=2000, 200cc N=3000, and 1600cc N=2000 probes each sat for ~301s
before recording lag values of 654s–675s. Bisecting an unhealthy
region pays this cost again at every step.
Split into two phases:
1. Poll `mz_internal.mz_hydration_statuses` until every test object
on `c` reports `hydrated = true`. This is a definitive per-object
signal — the dataflow has finished initial snapshotting — and
converges quickly even on cold-started 1600cc/3200cc replicas.
Timeout here means the replica is genuinely wedged.
2. Once hydrated, take the existing `CLUSTER_OBJECT_LIMITS_SAMPLES`
steady-state lag samples. Unhealthy now means "hydrated but
can't keep up", which is the property we actually want to
measure; an overloaded cluster trips the threshold in
`samples * sample_interval` (~10s) instead of in 300s.
With this decoupling the cold-start argument for the 300s timeout no
longer applies, so drop it back to 60s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dration budget In build 1227, 3200cc indexes and 1600cc MVs both produced false cliffs at N=100: the very first probe after CREATE CLUSTER timed out with 0/100 hydrated, but every subsequent bisect step (N=50/75/87/93) hydrated cleanly with lag=0.0. The cluster works fine — the replica just isn't reporting introspection within 60s of being created on multi-process sizes. Thread a per-call `timeout_s` into `hydrate_and_sample` and let `probe_and_record` pick between the regular and a longer "first probe" budget. The coarse N-walk passes `first_probe=True` only on its first iteration, so big-replica cold start gets headroom while every other probe keeps the tight 60s budget that makes unhealthy points cheap to record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No behavior changes; just shorter explanations where the original
prose restated the code or expanded beyond what a reader needs:
- MATERIALIZED_ADDITIONAL_SYSTEM_PARAMETER_DEFAULTS rationale: 7 → 3 lines
- _bulk_run docstring: 12 → 5 lines (keep the idempotency warning)
- hydrate_and_sample docstring: 25 → 13 lines (keep the "why split
the phases" justification)
- probe_lag_ms / probe_hydrated docstrings: drop the tuple-field
enumeration that duplicates the return type
- collapse the two "framework setup/drop unused" comments
- drop pure-label comments ("Snapshot of cluster sizes", "Outer loop")
and the init/teardown lines that restate the next statement
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`workflow_default` carried a 150-line if-ladder mapping each `SCENARIO_*`
string to a `run_scenario_*` invocation, plus four parallel sections that
open / upload / archive / analyze one CSV each. Adding a new scenario or
result kind meant touching every one of those.
Replace both with two data registries:
- `ScenarioSpec` + `SCENARIOS`: name, log label, family, factory lambda,
groups. `SCENARIOS_BY_NAME` and `SCENARIO_GROUPS` are derived from it,
so the hand-written `SCENARIOS_CLUSTERD` / `_COMPUTE` / etc. lists go
away. A `Family` literal + `FAMILY_TO_STREAM` table selects the
driver, and a small `run_spec` match replaces the if-ladder.
- `ResultStreamSpec` + `RESULT_STREAMS`: suffix, fieldnames, analyzer,
uploader. `workflow_default` opens one CSV per spec and the four
parallel close/upload/artifact/analyze blocks become single loops.
The old `SUFFIX_ANALYZERS` is now a derived alias.
The scenarios themselves, the four `run_scenario_*` drivers, and the
`Scenario` ABC are unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous `Scenario` ABC pretended that all four scenario families shared a single setup/drop/materialize_views/run lifecycle, but only strong/weak/envd_qps actually used it. EnvdObjectsScalability and ClusterObjectLimits returned [] from every ABC method and were driven through entirely different protocols (init/add_objects/teardown and reset_for_cluster_size/probe_*/teardown respectively), with comments apologising that "framework-level setup/drop are unused". Replace the single ABC with three real shapes: * `ClusterScalingScenario` (renamed from `Scenario`) for the strong/weak/envd_qps families. `drop()` and `materialize_views()` now default to `[]` so the envd_qps subclasses no longer need empty overrides. * `EnvdObjectsScalabilityScenario` becomes its own ABC with the methods it actually exposes; the unused `replica_size` constructor parameter is dropped. * `ClusterObjectLimitsScenario` becomes a plain class (no inheritance) and the unused `replica_size` parameter is dropped. `ScenarioSpec.factory` now returns the union `AnyScenario`, and `run_spec` narrows it per-family with isinstance asserts. With ClusterObjectLimitsScenario no longer pretending to be a generic Scenario, the 220-line `run_scenario_cluster_object_limits` collapses: the `hydrate_and_sample` and `probe_and_record` closures and the N-walk + bisect loop move onto the scenario as `_hydrate_and_sample`, `_probe_and_record`, and `run_for_cluster_size`. The driver shrinks to the outer cluster-size loop plus ScenarioRunner construction and `_recreate_cluster_c` / `reset_for_cluster_size` / `teardown` bookkeeping. Behaviour is unchanged: same 13 scenarios, same groups, same CSV output. Verified by direct module import and pyright/ruff/black. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three parallel scenario hierarchies (ClusterScalingScenario, EnvdObjectsScalabilityScenario, ClusterObjectLimitsScenario) and five run_scenario_* drivers collapsed onto one Scenario ABC with a prepare/scale_points/apply/measure/cleanup_point/teardown lifecycle. Strong/weak/envd-cpu/envd-objects become thin sweep wrappers around the existing inner workloads; ClusterObjectLimitsScenario implements Scenario directly. The result-stream choice now lives on each scenario via stream_key(), so Family, FAMILY_TO_STREAM, AnyScenario, RunContext and run_spec all go away. Also: shared _extend_incremental/_shrink_incremental helpers used by both envd_objects and cluster_object_limits scenarios, and the four per-kind workflow_plot_* entries collapse to one workflow_plot that dispatches by filename suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rop dead replica_size param ClusterObjectLimitsScenario._probe_and_record was the only caller that bypassed runner.add_result and wrote rows directly via results_writer.writerow(...), because add_result didn't know about the extra `healthy` column. Extend add_result with optional `time_ms` (for values already in ms) and `healthy` kwargs so the cluster_object_limits path matches every other scenario. The `healthy` column is silently dropped on streams whose schema doesn't include it via the existing extrasaction="ignore". Also drop the `replica_size` constructor parameter from ScenarioRunner: every sweep wrapper passes None and mutates `runner.replica_size` in `apply()`. The param was dead weight. _probe_and_record's `replica_size` argument is dropped for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add mz_persist_state_apply_calls_by_source_shard_kind IntCounterVec
with labels [source, shard_kind]. Plumb a &'static str `source`
parameter through State::apply_encoded_diffs / apply_diffs /
apply_diff so each call site is attributed. Four runtime sources:
* cas_update — apply.rs Applier::fetch_and_update_state fast path
* slow_refetch — state_versions.rs::fetch_current_state full replay
* pubsub_push — cache.rs PubSub broadcast intake
* state_iter — state_versions.rs StateVersionsIter::next walks (GC,
usage audit, admin inspect)
Used to identify the source of the catalog/txns apply_diff count
explosion at N=10k in the envd-ddl-scalability investigation.
Ran the bench three times with the new source-labeled counter.
Every catalog apply_diff invocation we observe is attributable to
state_iter (StateVersionsIter::next, used by GC). The original
"335/DDL catalog apply_diff" mystery resolves cleanly: exactly one
GC fires during the 100-rep N=10k window and walks ~17,000 live
diffs accumulated on the catalog shard since the previous GC.
Cross-check confirms it: shard_gc_finished{catalog} delta is +1 in
the run-3 N=10k window, shard_gc_live_diffs gauge reads 17,229 at
that GC, and the state_iter source counter delta is also 17,229.
Same number, three independent counters, one event.
Slope impact: ~1.7 ms/DDL of catalog state-apply work (17k calls
times ~10 us each, divided across the 100-rep window). Background
work, so its critical-path contribution is just the Tokio-scheduler
tax — third-tier behind the two CAS-RPC slopes (catalog +7 ms,
txns +4 ms in this run).
Run-to-run variance is high: across three runs, catalog apply_diff
at N=10k ranged from ~0 (no GC during the window) to 172/DDL (one
big GC). When GC didn't fire, this whole component was zero. The
previously-reported "+3.4 ms catalog state_apply slope" was from
a run that happened to coincide with a GC firing — real, but not
a persistent slope component.
The dominant slope remains catalog consensus_cas RPC time
(+1.0-1.9 ms per call x 5.6 calls = +6-11 ms/DDL). Microbench
already ruled out per-shard state size as the cause; next move is
to factor the CAS RPC into Tokio-scheduler-wait vs wire-time, to
confirm shared-runtime contention from the 10k user_data shards.
Add a new mz_persist_consensus_wire_seconds_by_shard_kind histogram recorded inside BogoConsensus around the inner bogo gRPC client call. This is the same axes (op + shard_kind, identical buckets) as the existing mz_persist_external_op_latency_by_shard_kind, but measured one layer deeper. The existing external_op_latency_by_kind is recorded by MetricsConsensus around its run_op wrapper. That sits *inside* the Tasked spawn and *outside* the bogo gRPC client adapter. Subtracting the new wire metric from it gives "post-spawn wrapper overhead", leaving wire = client tonic send + HTTP/2 + server processing + return. Wiring: BogoConsensusConfig grows an optional wire_timer callback (no-op for non-bogo backends, no impact on other consensus impls). PersistClientCache::open_consensus attaches a closure that derives shard_kind via the existing classifier and observes into the new HistogramVec. Other Consensus methods (head/scan/truncate) are wired up for consistency.
`update_state_metrics` was iterating every `Vec.len()` in the in-memory BTreeMap on every CAS to recompute `versions_total`. At 10k shards that's ~100 µs of work held under the same `std::sync:: Mutex` that serializes every operation. With ~100 concurrent CAS in flight from envd's background work, the lock queue depth explodes and every CAS — catalog, txns, user_data — pays the queueing cost. This was eating the entire catalog `consensus_cas` slope we'd been chasing. With the fix: catalog wire mean: 0.77 ms (N=5k) -> 0.29 ms (N=10k) (was +1.02) user_data wire mean: 6.68 ms -> 0.92 ms (was +28.3) Both flat across the scale jump. The "catalog CAS grows with N" finding was a bench artifact, not a Materialize scaling problem. Replace with incremental counters: bump `shards_total` when a CAS creates a new key, bump `versions_total` by +1 on a successful CAS, decrement by the deletion count on truncate. Mutex is released before bumping so even the IntGauge.add isn't on the contended path. Constant time per op, no scan.
Re-ran the bench against CRDB consensus instead of bogo to confirm the post-fix bogo conclusion holds on a real backend. It does: the Materialize-side slope (~+15 ms/+5k tables on create p50) reproduces nearly identically on CRDB. Bogo's flat CAS-mean wasn't hiding a backend-only problem. CRDB adds two things on top: - a flat ~+28 ms baseline tax from real consensus RPCs (~2 ms/CAS vs <0.5 ms on bogo) × ~12 CAS/DDL. - a mild secondary CAS-mean slope (catalog 1.88 → 2.11 → 3.80 ms across 5k → 10k → 15k); counts are stable so this is per-RPC slowdown, plausibly index/plan growth on the consensus table. Resource ceiling: envd hit 2.9 GiB RSS at N=15k, CRDB stayed under 1 GiB — plenty of headroom on this box, can extend the curve if needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add `mz_catalog_transact_phase_seconds{phase=...}` histogram that
splits each Catalog::transact call into:
- transact_inner (super-timer for the inner method)
- op_loop (per-op transact_op + preliminary apply)
- final_apply_updates (post-loop combined apply on final state)
- prepare_state (storage_collections.prepare_state)
- post_prepare_apply_updates (final apply after prepare_state)
- tx_commit (durable tx.commit, the persist CAS path)
- assign_state (self.state = new_state)
The metric is owned by the `Coordinator`'s `Metrics` struct and
plumbed into `Catalog` as an `Option<HistogramVec>` via a new
`set_transact_phase_metrics()` setter, called once at coordinator
startup. `transact_incremental_dry_run` doesn't get the metric --
dry-run DDL-txn paths are deliberately excluded so they don't
pollute the foreground numbers.
Motivation: instrument the +9.87 ms/+5k slope on
`catalog_transact_with_ddl_transaction` so we can attribute it to
specific phases. See test/envd-ddl-scalability/NOTES.md (next
commit) for the breakdown.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-ran bogo bench at N=5k/10k/15k with the new mz_catalog_transact_phase_seconds metric. Inside Catalog::transact accounts for only ~2.7 ms of the +9.0 ms 5k->10k slope; the other +6.3 ms is in the Coordinator wrapper layer (Arc::make_mut(catalog), builtin_table_updates execute, apply_catalog_implications, finalize). At 10k->15k it's even more lopsided: +4.7 inside vs +10.3 outside. Inside the inner method, tx_commit is the biggest mover (~half of the inside-transact slope). prepare_state shows a 7x hockey-stick from 10k to 15k (0.19 -> 1.41 ms) -- the storage_controller side warrants its own look. op_loop and final_apply_updates grow modestly, matching the state-apply attribution from the prior iteration. Next: instrument the outside wrapper layer (coord_arc_make_mut, coord_post_transact, etc.) to split that +6-10 ms/+5k outside slope into named pieces. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extends mz_catalog_transact_phase_seconds with six wrapper-layer labels so we can split the +6-10 ms/+5k "outside Catalog::transact" slope: - coord_inner_total (entire method, cross-check) - coord_pre_transact (entry -> just before catalog.transact) - coord_arc_make_mut (just Arc::make_mut(catalog)) - coord_post_transact (just after catalog.transact -> return) - coord_builtin_table_execute (builtin_table_update().execute()) - coord_finalize (the config/tracing finalize block) `metrics.catalog_transact_phase_seconds` is cloned out before the existing destructure of `self` so we can keep using `self` for `builtin_table_update()` etc. afterwards. Motivation: the prior commit showed +6-10 ms/+5k of slope lives outside Catalog::transact in the Coordinator wrapper. See test/envd-ddl-scalability/NOTES.md (next commit) for what the breakdown actually shows. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-ran the bogo bench at N=5k/10k/15k with wrapper-layer phase
instrumentation. Three confirmations and one ruling-out:
- coord_builtin_table_execute is the single biggest slope
component: +4.14 then +4.45 ms per +5k. At N=15k it's 16.85
ms/DDL, ~29% of total DDL latency.
- coord_pre_transact is flat (~3.2 ms) across scales -- the op
pre-walk + validate_resource_limits + write-ts grab don't
scale.
- apply_catalog_implications is roughly flat (~11-13 ms) -- big
but not the slope owner.
- Arc::make_mut(catalog) is ~0 ms at all scales: the catalog Arc
is uniquely held in the hot path, so the make_mut clone never
fires. Earlier hypothesis ruled out.
builtin_table_update().execute() is essentially a call to
Coordinator::group_commit(None).await. The size of
builtin_table_updates per DDL is small (a few rows in mz_objects
etc.), so the growth has to be inside group_commit itself -- next
target is to instrument the upper-advancement / table-append
phases there.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove the loop in `group_commit()` that iterated ALL catalog entries on every group commit to add empty advancement entries for every table. With ~20k+ tables, this was the dominant DDL bottleneck at ~23% of DDL time. The txn-wal protocol makes explicit per-table advancement unnecessary: when any transaction commits to the txns shard, the logical upper of ALL registered data shards advances automatically, including those not involved in the transaction. The empty advancement entries were performing no useful work on the storage side. Also removed the early-return optimization in PersistTableWriteWorker::append that skipped the txn-wal commit for empty updates. This ensures periodic group commits (with no actual data writes) still advance the txns shard upper, maintaining the property that table logical uppers advance even when no writes are happening. Results at ~28k objects (optimized build): - CREATE TABLE: 374ms → 131ms (65% faster) - CREATE VIEW: ~96ms - DROP TABLE: ~97ms - DROP VIEW: ~86ms Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cute is now flat Validation bench at N=5k/10k/15k after 5d2d138 (remove O(n) table advancement loop from group_commit). create_p50 dropped by 5.6 / 9.5 / 14.3 ms across the three scales; per-+5k slope reduction is 38% / 29%. coord_builtin_table_execute itself is now essentially flat: mean per call goes 3.80 -> 3.91 -> 4.59 ms (was 8.26 -> 12.40 -> 16.85 pre-fix). Slope per +5k inside the timer drops from +4.14 / +4.45 ms to +0.11 / +0.68 ms. The remaining ~+11 ms-per-+5k headline slope is now spread across transact_inner, tx_commit, and apply_catalog_implications. Largest single absolute cost per DDL is apply_catalog_implications at ~11.6 ms (flat). That's the next investigation target: a sub-phase split to attribute where the constant 25%-of-DDL cost lives. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a new HistogramVec `mz_apply_catalog_implications_phase_seconds` labeled by phase. apply_catalog_implications had only a single outer timer (~11.6 ms per call regardless of N) so we could not tell what fraction of that constant cost lives in each region. Phases captured: * absorb_updates — the implication batching loop at the top of apply_catalog_implications * inner_total — the whole call into apply_catalog_implications_inner (split below) * inner_item_loop — the `for (catalog_id, implication) in implications` loop that walks per-item implications * inner_cluster_loops — the cluster + cluster_replica command loops * inner_controller_setup — the post-loop calls into the controllers: create_source_collections, create_table_collections, initialize_storage_collections, vpc_endpoints, alter_* * inner_dependency_scan — active-sink/peek/copy cleanup + timeline association rebuilds * inner_finalize — the "no error returns allowed" async block: drops, retires, background secret/replication-slot cleanup Same pattern as `mz_catalog_transact_phase_seconds`: histogram cloned locally, RAII timers via `start_timer()`. No behavioral change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cations Sub-phase split of mz_apply_catalog_implications_seconds at N=5k/10k/15k. inner_controller_setup (i.e. create_table_collections + initialize_storage_collections, etc.) is 84% of inner_total and carries ~93% of the slope. CREATE-only inner_controller_setup is 16.32 / 17.13 / 19.70 ms across scales. That's the single call into controller.storage.create_collections opening a fresh persist WriteHandle + SinceHandle for the new table shard, then a compare_and_downgrade_since on it. DROP-only inner_finalize is ~3.6 ms — drop_tables -> txn-wal append. Mostly flat. Everything else (absorb_updates, item loop, cluster loops, dependency scan) is microseconds in this workload and will not matter until we have real user-cluster activity. Next: split create_table_collections / create_collections into phases (write-ts grab, advance_upper, open handles, downgrade-since, install collection state) to attribute the 20 ms CREATE cost. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add four sub-phase labels to mz_apply_catalog_implications_phase_seconds: - create_table_write_ts (get_local_write_ts) - create_table_advance_upper (catalog.advance_upper) - create_table_storage_create_collections (controller.storage.create_collections) - create_table_apply_local_write (apply_local_write) inner_controller_setup is 16+ ms per CREATE at N=15k and carries the dominant slope inside apply_catalog_implications. This split tells us which of the four steps is responsible. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sub-phase split of create_table_collections at N=5k/10k/15k. Inside apply_catalog_implications's inner_controller_setup (16-22 ms per CREATE), the four named steps break down as: - storage.create_collections: 8.16 -> 8.98 -> 11.57 ms (slope +0.82 / +2.59) - get_local_write_ts: 3.27 -> 3.62 -> 3.61 ms (flat ~3.6) - apply_local_write: 2.88 -> 3.12 -> 3.29 ms (flat ~3) - catalog.advance_upper: 0.56 -> 0.61 -> 0.71 ms (small slope) storage.create_collections is both the dominant absolute cost AND the dominant slope owner (74% of inner_controller_setup slope). Next target: split storage_collections::create_collections_for_bootstrap into open_data_handles, compare_and_downgrade_since, and the collection-install loop. Existing info_span!s mark the boundaries. Secondary finding: CREATE TABLE makes two get_local_write_ts calls (one in catalog_transact_inner, one in create_table_collections), each ~3 ms. Worth a design review of whether the second is required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…strap Split storage.create_collections (the slope owner inside CREATE TABLE controller setup, identified in commit 7491526) into two layers of sub-phases: storage_collections.create_collections_for_bootstrap: - validate_and_enrich - open_persist_client - open_data_handles_concurrent (buffer_unordered 50) - sort - install_collection_states (under collections mutex) - synchronize_finalized_shards storage_controller.create_collections_for_bootstrap: - storage_collections_call (inner call to the above) - validate_and_enrich - open_persist_client - open_data_handles_concurrent (the second buffer_unordered, controller's own per-collection write handle) - register_loop (per-collection for loop with acquire_read_holds) - init_source_statistics - table_register (persist_table_worker.register batched call) - append_shard_mappings - run_to_execute Two new HistogramVecs registered: - mz_storage_collections_create_collections_phase_seconds - mz_storage_controller_create_collections_phase_seconds Both keyed on phase label. Will use to attribute the 11.57 ms / +2.59 ms per +5k slope inside controller.storage.create_collections at N=15k.
… is in Catalog::transact Phase 7 ran the new mz_storage_collections / mz_storage_controller create_collections_phase_seconds histograms at N=5k/10k/15k. Findings: 1. storage.create_collections is flat at ~9 ms per CREATE TABLE across all three scales. The phase 6 measurement (11.57 ms at N=15k with +2.59 ms per +5k slope) was run-to-run variance on a single data point, not a real slope. 2. Inside storage_collections::create_collections_for_bootstrap the only sub-phase with a visible slope is install_collection_states (0.19 -> 0.53 -> 1.54 ms). That's the post-stream loop running under the collections mutex; the work is BTreeMap inserts plus channel sends. Probably not worth fixing at current N. 3. open_data_handles_concurrent (the buffer_unordered stream that opens the SinceHandle + WriteHandle pair) is the biggest absolute cost at ~5 ms/CREATE but it's flat. The per-shard persist work doesn't grow with the number of already-registered shards. 4. apply_catalog_implications.inner_controller_setup is also flat now (9.43 -> 9.44 -> 9.72 ms). Phase 5's slope claim was either fixed between then and now, or also variance on a single N=15k point. Where the slope actually lives now: Catalog::transact, specifically transact_inner (+2.5 ms/+5k), tx_commit (+1.5 ms/+5k), op_loop (+0.8 ms/+5k), and the apply_updates family (+1.8 ms/+5k combined). coord_inner_total is +5.48 ms per +5k tables per call; doubled across CREATE+DROP that's +11 ms/+5k per DDL, which fully accounts for the observed create_p50 slope of +9-10 ms/+5k. Next iteration target: instrument tx_commit and the StateUpdate appliers inside Catalog::transact.
Adds two HistogramVecs to drill into tx_commit, which phase 7 attributed
+1.51 ms / +5k slope to:
mz_catalog_commit_transaction_phase_seconds{phase}
- caa_fence_check
- caa_encode
- caa_persist_caa_inner (the inner compare_and_append_inner wrapper)
- caa_persist_compare_and_append (the actual persist write_handle.CaA)
- caa_since_downgrade (since handle maybe_compare_and_downgrade_since)
- caa_post_sync (the sync(next_upper) call after CaA)
mz_catalog_sync_phase_seconds{phase}
- listen_fetch (listen.fetch_next, summed across iterations)
- apply_updates (apply_updates, summed across timestamps)
- consolidate (maybe_consolidate + final consolidate)
sync_phase_seconds aggregates across all iterations within a single
sync_inner call, so each tx_commit produces one sample per phase, not
one per listen event.
Will pin down whether the +1.51 ms / +5k tx_commit slope lives in the
persist CaA, the post-CaA sync, or the in-memory consolidate.
Phase 8 attributed the +7 ms/+5k slope in tx_commit at N=5k -> N=10k
to mz_catalog_sync_phase_seconds{phase="consolidate"}:
N=5k: 0.64 ms/call x 6 calls/DDL = 3.82 ms/DDL
N=10k: 1.87 ms/call x 6 calls/DDL = 11.20 ms/DDL
N=15k: 2.39 ms/call x 6 calls/DDL = 14.37 ms/DDL
Root cause: sync_inner ended with an unconditional consolidate(), which
does O(N log N) work on the entire snapshot. For the steady-state case
(one timestamp per sync, ~5-10 new entries added to a 15k-entry snapshot)
this ran on every DDL commit, paying the full sort + dedup cost for a
trivial delta.
The doubling-threshold maybe_consolidate (added in MaterializeInc#36233) was supposed
to amortize this, but it never triggered: sync_inner reset
size_at_last_consolidation to None at the top of every call, so the
threshold was always re-baselined against the current snapshot size.
A single DDL never grows the snapshot by 2x, so maybe_consolidate
inside the loop did nothing — and then the unconditional consolidate
at the end paid the full O(N log N) cost.
Two changes:
1. Drop the size_at_last_consolidation = None reset at the top of
sync_inner. The doubling threshold is meant to amortize across the
snapshot's lifetime, not per sync_inner invocation.
2. Replace the unconditional self.consolidate() at the end with
self.maybe_consolidate(). Combined with the per-ts maybe_consolidate
inside the loop, this keeps memory bounded at 2x the last
consolidated size while making per-call cost amortized O(log N)
instead of O(N log N).
Verified existing tests still pass:
test_persist_sync_consolidation_not_quadratic ok
test_persist_sync_snapshot_stays_bounded_under_churn ok
The "stays bounded under churn" test (200 renames of one DB) still passes
because the persistent threshold + per-ts maybe_consolidate fires every
~log(N) steps. The "not quadratic" test still passes because total
consolidations during a 100-ts sync stay well under the test's bound of 10.
Phase 8 split tx_commit into commit_transaction phases + sync phases.
Found mz_catalog_sync_phase_seconds{phase="consolidate"} at:
N=5k: 3.82 ms/DDL
N=10k: 11.20 ms/DDL
N=15k: 14.37 ms/DDL
Phase 9 (post-fix in commit 00d31c5) shows consolidate flat at 0 ms
across all N, tx_commit per-call flat at ~1.13 ms (was 2.92 -> 5.54).
create_p50 at N=15k dropped from 48.93 to 44.51 ms (-4.42 ms);
slope at 10k->15k dropped from +9.11 to +5.81 ms/+5k (35% reduction).
The residual slope has moved entirely to in-memory state-apply paths
(transact_inner outer, op_loop, apply_updates family). Next iteration
target: profile CatalogState::apply_updates for the per-update walk
that still scales with N.
Phase 9 confirmed the catalog `consolidate` slope is gone. The residual
+5.81 ms/+5k slope at 10k→15k is still inside Catalog::transact_inner,
specifically in `apply_updates` and family. We have no visibility into
which sub-step of apply_updates owns that slope.
Add two new histograms:
* mz_catalog_apply_updates_phase_seconds{phase}
One observation per apply_updates call, per sub-phase:
- consolidate_initial (the per-call consolidate_updates)
- sort_per_group (sort_updates per timestamp group)
- apply_updates_inner (the kind-dispatch loop)
- cleanup_notices (drop_optimizer_notices + pack)
* mz_catalog_apply_update_kind_seconds{kind}
One observation per StateUpdate inside apply_updates_inner,
labeled by StateUpdateKind variant (item, schema,
storage_collection_metadata, etc.). The 1e-5 lower bucket lets us
see individual sub-microsecond updates so per-kind contributions
add up correctly across hundreds of updates per DDL.
Strong working hypothesis for the slope: `Arc::make_mut` on
`storage_metadata` (whose inner `collection_metadata` is a non-persistent
`BTreeMap<GlobalId, ShardId>`) does a full O(N) clone every time the
Arc is shared between `preliminary_state` and `state` Cow's — which
happens twice per DDL through the op_loop + final_apply_updates path.
At N=15k that's ~5 ms of pure clone work, matching the observed slope.
The next phase will run the bench and confirm or refute this from the
per-kind data.
Phase 10's per-kind apply_updates instrumentation pointed at `StateUpdateKind::Item` as the dominant slope owner (+850 us per call per +10k tables, on 4 calls per DDL). The work that scales with N is inside `apply_item_update` / `insert_entry` / `drop_item`, which call `self.get_schema_mut(...)` to walk `database_by_id.get_mut(...).schemas_by_id.get_mut(...)`. `schemas_by_id` is `imbl::OrdMap<SchemaId, Schema>`, which is shared between `preliminary_state` and `state` Cow's in `transact_inner`. `get_mut` on a shared `imbl::OrdMap` path-copies the affected B-tree leaf and **clones every value in the leaf**, not just the targeted one. Schema embedded three non-persistent maps: pub items: BTreeMap<String, CatalogItemId>, pub functions: BTreeMap<String, CatalogItemId>, pub types: BTreeMap<String, CatalogItemId>, At N=15k the audit_pad schema's `items` has 15k entries. The B-tree leaf containing the materialize-database schemas almost certainly fits in one imbl chunk, so any `apply_item_update` (even mutating audit_meas, not audit_pad) leaf-copies audit_pad's Schema and clones its 15k-entry BTreeMap. That's the O(N) memcpy+tree-build per call that the phase 10 per-kind metric attributed to `item`. Switching items/functions/types to `imbl::OrdMap<String, CatalogItemId>` makes the per-leaf-clone path O(1) (refcount bump on a persistent tree root). All call sites use only operations common to both types (get / insert / remove / contains_key / is_empty / len / values / iter), so it's a drop-in. The fn pointer signature in `CatalogState::resolve` is updated to match. Phase 11 bench follow-up will validate that `item` mean per call flattens with N.
…ections Phase 10's per-kind apply_updates instrumentation showed `StateUpdateKind::StorageCollectionMetadata` had a clean +200 us/call slope at +5k tables (2 calls per DDL, ~+0.4 ms/DDL). `StorageMetadata` lives behind `Arc<StorageMetadata>` on `CatalogState`. The `preliminary_state`/`state` Cow pattern in `Catalog::transact_inner` shares this Arc, so the first `Arc::make_mut(storage_metadata)` per independently-owned `CatalogState` deep-clones its fields: pub collection_metadata: BTreeMap<GlobalId, ShardId>, pub unfinalized_shards: BTreeSet<ShardId>, With N=15k tables, `collection_metadata` has ~15k entries, and the `BTreeMap` clone is O(N). At 2 make_mut'd `CatalogState`s per DDL, that's two full clones per DDL. Switch both to imbl::OrdMap / imbl::OrdSet so the clone is O(1) (persistent tree refcount bump). All external callers use only operations common to both (.get(), .contains(), .iter()). The only local API divergence: `imbl::OrdSet::insert/remove` return `Option<T>`, not `bool` — `apply_unfinalized_shard_update` is adjusted accordingly. Same reasoning as ad197b0 (Schema.items/functions/types), applied to the storage-side analogue.
Phase 10 instrumentation (00025cb) split apply_updates into sub-phases + per-StateUpdateKind, attributing the +1.94 ms/+5k apply_updates_inner slope at 10k→15k to two non-persistent collections living behind shared imbl/Arc handles: 1. Schema.items/functions/types: BTreeMap<String, CatalogItemId> — cloned via imbl::OrdMap leaf path-copy in get_schema_mut. At N=15k the audit_pad schema's items had 15k entries; every apply_item_update clone cost O(N). 2. StorageMetadata.{collection_metadata, unfinalized_shards}: BTreeMap/BTreeSet behind Arc<StorageMetadata>. The preliminary_state/state Cow split in transact_inner forces Arc::make_mut to deep-clone these on each owned CatalogState. Phase 11 (ad197b0) and phase 12 (4b6f5d1) swap both to their imbl persistent counterparts. Results at N=15k: - item kind: 1436 -> 242 us/call (-83%) - storage_collection_metadata kind: 567 -> 5.12 us/call (-99%) - apply_updates_inner slope (10->15k): +1.94 -> +0.18 ms/DDL - create_p50 (15k): 48.93 (phase 7) -> 42.84 (phase 12), -6.09 ms NOTES.md captures the per-phase tables and writes up the generalizable pattern: inline value types stored inside an imbl::OrdMap (or behind shared Arc) silently lose the O(1) clone property to any non-persistent sub-collection field.
…ue pattern
Read-only audit of every imbl::OrdMap in the catalog/controller hot
paths, looking for the same "outer is persistent, inner is not"
shape that drove phases 11 and 12.
Findings, ranked:
HIGH (same shape, same hot path, drop-in fix):
- Database.{schemas_by_id, schemas_by_name}: BTreeMap inside
Cluster Cluster -> wrong, Database value held in
imbl::OrdMap<DatabaseId, Database>. get_schema_mut is on the
every-apply_item_update path. Worth landing next.
MEDIUM (workload-dependent, not exercised by the audit_pad bench):
- Cluster.bound_objects: BTreeSet<CatalogItemId> grows with every
object bound to that cluster. Invisible in this bench (plain
tables don't bind to a cluster), but expected to slope under MV /
index workloads on a single cluster.
- Cluster.replica_id_by_name_ / replicas_by_id_ / log_indexes:
small in practice.
LOW (small or workload-specific):
- Role.{vars.map, membership.map}: per-role counts are small.
- SourceReferences.references: grows with one source's refs, not N.
- CatalogEntry.{referenced_by, used_by}: small per entry, but the
16-entry imbl leaf clone of entry_by_id also re-clones
CatalogItem (incl. optimized/physical plans for MV/Index/CT) for
sibling entries; matters under MV scale, not table scale.
- notices_by_dep_id: value is Vec<Arc<_>>, shallow clone is cheap.
Recommended next step: land the Database BTreeMap -> imbl::OrdMap
swap, then design a real MV/index scale bench before considering
the MEDIUM tier. The existing
mz_catalog_apply_update_kind_seconds{kind} histogram is the
canonical signal for whether each tier is worth fixing.
…istent collections Phase 11+12 fixed the two non-persistent inner collections that the audit_pad bench exposed (Schema.items and StorageMetadata.collection_metadata). This commit applies the same fix to the remaining sites identified by the read-only sweep in 351ddb9: Database.{schemas_by_id, schemas_by_name}: BTreeMap -> imbl::OrdMap Cluster.{bound_objects, replica_id_by_name_, replicas_by_id_}: BTreeMap/BTreeSet -> imbl::OrdMap/imbl::OrdSet RoleMembership.map: BTreeMap -> imbl::OrdMap RoleVars.map: BTreeMap -> imbl::OrdMap SourceReferences.references: Vec -> imbl::Vector All of these live inside imbl::OrdMap<K, V> on CatalogState (or transitively inside such a value type), so the preliminary_state/state Cow split in Catalog::transact_inner forces imbl leaf path-copies to deep-clone them. The audit_pad bench doesn't exercise these paths (plain tables don't touch clusters / roles / sources), but the same shape that produced the +5 ms/DDL slopes in phases 11+12 would surface on cluster-heavy / role-heavy workloads. Two intentional non-changes: * Cluster.log_indexes stays BTreeMap<LogVariant, GlobalId>: tight API contract with the compute controller (arranged_logs: BTreeMap<...>) and bounded by ~10 log variants. * CatalogEntry.{referenced_by, used_by} stay Vec<CatalogItemId>: the mz_sql::catalog::CatalogItem trait returns &[CatalogItemId] for them, which imbl::Vector can't satisfy (no Deref<[T]>). Per-entry counts are small and the dominant per-entry clone cost during entry_by_id leaf path-copy is item: CatalogItem, not these vectors. Trait signatures in mz_sql::catalog::{CatalogDatabase, CatalogRole, CatalogCluster} change to match (&imbl::OrdMap / &imbl::OrdSet). The per-field "why imbl" comments on Schema and StorageMetadata are removed and consolidated into a single rule-block doc comment on CatalogState explaining the pattern, the two slopes that motivated it, and the rule for new fields. Two intentional holdouts are called out there as well.
…on tables The phase 13 sweep (1a8446c) switched five more inline-in-imbl::OrdMap collections off of BTreeMap/BTreeSet/Vec, but the audit_pad bench (plain CREATE/DROP TABLE) doesn't exercise the cluster, role, or source paths — so the expected outcome was "no measurable change, no regression." That's what we got. item @ N=15k: 242 -> 238 us/call. storage_collection_metadata unchanged at ~5 us/call. apply_updates_inner total at N=15k flat at 1.48 ms/DDL. Run-to-run noise dominates. The sweep is landmine-prevention for cluster-heavy / role-heavy / source-heavy workloads where the same leaf-clone pattern would surface a slope on those DDL kinds. A future scale bench targeting CREATE INDEX / MV / GRANT is the right way to actually measure those payoffs; the existing per-kind histogram is the canonical signal.
Adds mz_storage_collections_prepare_state_phase_seconds{phase} so we can
attribute the prepare_state slope (currently 2.4 ms/call at N=15k user
tables, growing roughly linearly) to its sub-phases.
Sub-phases:
- insert_add, insert_register, delete: txn mutations.
- dropped_shard_lookup: self.collections.lock() acquire + the loop
over dropped_mappings.
- insert_unfinalized: txn.insert_unfinalized_shards.
- mark_finalized: self.finalized_shards.lock() + txn.mark_shards_as_finalized.
Companion to mz_catalog_transact_phase_seconds (outer) and
mz_apply_catalog_implications_phase_seconds.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the txns shard upper advances, BackgroundTask::run propagates the new upper to every txns-backed user table by calling self.update_write_frontiers(...) with one entry per table. The previous implementation held self.collections (the single global mutex shared with every DDL path, including catalog prepare_state) for the full O(N) walk. At N=15k user tables that meant one txns-upper tick = ~10+ ms of held mutex, plus the BackgroundTask staying CPU-bound while it held it. Phase-level bench (mz_storage_collections_prepare_state_phase_seconds) attributed the slope to exactly this: dropped_shard_lookup (which only does self.collections.lock() + an empty for-loop in our CREATE TABLE workload) went from 1.6 µs/call at N=5k to 599 µs/call at N=10k, matching the OUTER prepare_state mean (24 -> 625 µs). Process updates in chunks of 256, releasing the lock and yielding the task between chunks. The fix mostly helps other phases that compete with the BackgroundTask for CPU and lock access: at N=10k the end-to-end create_p50 drops by 11 ms (63.18 -> 52.04), driven by: - coord_builtin_table_execute -3.58 ms/DDL - coord_pre_transact -1.87 ms/DDL - tx_commit -1.56 ms/DDL prepare_state itself gets slightly worse (the per-chunk lock dance costs a few µs and the BackgroundTask gets to re-win the lock more often), but the BackgroundTask no longer monopolizes either lock or CPU for the whole walk, and the DDL pipeline catches up faster. A future architectural fix would store the shared txns upper in one place on StorageCollectionsImpl and skip the O(N) propagation entirely; that's larger and touches every reader of write_frontier on a txns-backed collection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
NOTES writeup for phase 14: - mz_storage_collections_prepare_state_phase_seconds attributed the prepare_state slope to dropped_shard_lookup (self.collections.lock acquire), confirming BackgroundTask::update_write_frontiers as the contention source. - The chunked unlock in update_write_frontiers ended up being a CPU yield win, not the lock-wait win the source comment suggested: prepare_state's own contention got slightly worse (lock barging by the BackgroundTask), but other phases that compete with it for CPU caught up much faster. - Net same-run effect at N=10k: create_p50 63.18 -> 52.04 (-11 ms), coord_inner_total 37.59 -> 29.52 (-8 ms). - prepare_state slope is not flat; a future architectural pass needs to store the shared txns upper in one field rather than fanning it out to every txns-backed collection's write_frontier on every tick. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
For running spec sheet