Skip to content

fix(container-runtime): block local submits during stashed-op apply#27297

Open
anthony-murphy wants to merge 19 commits into
microsoft:mainfrom
anthony-murphy-agent:debug-pending-state
Open

fix(container-runtime): block local submits during stashed-op apply#27297
anthony-murphy wants to merge 19 commits into
microsoft:mainfrom
anthony-murphy-agent:debug-pending-state

Conversation

@anthony-murphy
Copy link
Copy Markdown
Contributor

@anthony-murphy anthony-murphy commented May 13, 2026

Summary

  • PendingStateManager exposes a one-way apply lifecycle (notStartedapplyingended). The window opens eagerly in the constructor when stashed state is present and closes the first time applyStashedOpsAt drains initialMessages successfully. An apply error leaves the lifecycle at applying.
  • ContainerRuntime.isReadOnly() aggregates pendingStateManager.isApplyingStashedOps and fans out via notifyReadOnlyState, so compliant DDSes skip submits during stashed-op replay.
  • ContainerRuntime.submit() always emits a SubmitDuringStashedOpApply error event on a bypass during the apply window, then throws a fatal UsageError unless the kill switch Fluid.ContainerRuntime.DisableSubmitDuringStashedApplyThrow is set. BlobAttach and IdAllocation are allowlisted as legitimate runtime-internal traffic during apply. Checking at submit time (rather than at onFlushBatch) closes the gap where a deferred flush would land after the apply window closes and mask the misuse.
  • ContainerRuntime.replayPendingStates asserts that the apply window is not open. The only real caller is setConnectionStateCore's canSendOps edge, and the loader awaits applyStashedOpsAt before transitioning to a write-capable connection — the assert surfaces a contract violation rather than silently swallowing the edge.

Background

When a host calls loadContainer with pendingLocalState, the runtime replays stashed ops via applyStashedOpsAt. If a DDS submits a local op during that replay — typically a realize-time write that doesn't consult readOnly — the new op has no counterpart in the saved-op stream and the next saved-op replay throws "pending local message content mismatch". This change makes the runtime effectively readonly for the duration of the replay and converts the bypass case into an immediate, attributable failure, with a kill switch for production rollout safety.

Risks & tradeoffs

  • baseLogger reshape (incidental). ContainerRuntime.baseLogger is now built via createChildLogger with always-on error properties (inStagingMode, isApplyingStashedOps, isReadOnly). Every error event flowing through baseLogger — including downstream consumers — now carries these tags. Behavior change for telemetry pipelines reading raw baseLogger output.
  • Kill switch present. Fluid.ContainerRuntime.DisableSubmitDuringStashedApplyThrow is default-off (throw + close on bypass). When set, the UsageError is still constructed and surfaced as an error event but is not thrown — preserves attribution while leaving an off-switch if a first- or third-party DDS in production quietly bypasses the readonly gate.

Test plan

  • PSM lifecycle tests cover eager open, no-op without stashed state, full-drain close, stay-open across partial drains, the close hook firing exactly once across repeated applyStashedOpsAt calls, and lifecycle staying "applying" when the final apply throws.
  • ContainerRuntime submit-during-apply tests: FluidDataStoreOp throws + logs error event, BlobAttach allowlisted (no throw), kill switch suppresses throw and still logs, no throw outside the apply window.
  • Existing PSM "has both pending messages and initial messages" test updated to push directly to the private queue (the prior flush-while-applying path is now a usage error).
  • Container-runtime mocha suite: 972 passing locally, biome format clean, eslint clean.

🤖 Generated with Claude Code — signed: Claude (Opus 4.7)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

Hi! Thank you for opening this PR. Want me to review it?

Based on the diff (585 lines, 5 files), I've queued these reviewers:

  • Correctness — logic errors, race conditions, lifecycle issues
  • Security — vulnerabilities, secret exposure, injection
  • API Compatibility — breaking changes, release tags, type design
  • Performance — algorithmic regressions, memory leaks
  • Testing — coverage gaps, hollow tests

How this works

  • Adjust the reviewer set by ticking/unticking boxes above. Reviewer toggles alone don't trigger anything.

  • Tick Start review below to dispatch the review fleet.

  • After review finishes, tick Start review again to request another run — it auto-resets after each dispatch.

  • This comment updates as new commits land; your reviewer selections are preserved.

  • Start review

Comment thread packages/runtime/container-runtime/src/containerRuntime.ts Outdated
Comment thread packages/runtime/container-runtime/src/pendingStateManager.ts
anthony-murphy and others added 2 commits May 12, 2026 19:20
PendingStateManager now exposes a one-way apply lifecycle that opens
eagerly when stashed state is present and closes when initialMessages
drains. ContainerRuntime aggregates this into isReadOnly() and fans it
out via notifyReadOnlyState so compliant DDSes skip submits during
stashed-op replay. If a DDS bypasses the readonly gate and submits
anyway, submit() throws a fatal UsageError at the call site rather than
letting a poison entry race the saved-op replay and mismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
isReadOnly() now uses optional-chain on pendingStateManager since the
baseLogger evaluates this lazy property in a constructor window before
the PSM is assigned, which would throw TypeError on error telemetry.

applyStashedOpsAt() now tracks loop completion explicitly; a failure on
the last stashed op no longer flips the lifecycle to "ended" (the
failing message is shifted off initialMessages before apply runs, so
the queue otherwise looks drained). New regression test covers the
failure-on-last-op case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ings

ESLint rule flagged indented continuation lines inside two block comments
on PendingStateManager. Reflowed both: the hooks-interface comment uses
separate paragraphs instead of a bulleted list with indented continuations,
and the lifecycle comment runs as flat prose instead of indented bullets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts Outdated
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
…ger too

The baseLogger now stamps isReadOnly on every error event. During
construction, error events fire (e.g. layer-compat failures via
validateLoaderCompatibility) before _deltaManager is assigned, so the
prior fix's optional chain on pendingStateManager wasn't enough — the
deltaManager.readOnlyInfo read threw TypeError and masked the layer
incompat error, breaking runtimeLayerCompatValidation tests in CI.

Switching to this._deltaManager?.readOnlyInfo lets the lazy property
return false (well-defined) during the partial-construction window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts Outdated
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
- Remove `onBeforeFirstStashedOpApply` hook entirely. It was a no-op at
  the only call site (channelCollection is not yet constructed when the
  PSM constructor runs), and the misleading comment risked masking a
  future regression. PSM still flips `_applyLifecycle` eagerly; no fanout
  fires until the close hook.
- Allowlist `BlobAttach` and `IdAllocation` message types in the
  submit-during-apply guard. `BlobAttach` is produced by
  `sharePendingBlobs`, which runs before `applyStashedOpsAt` resolves.
  `IdAllocation` is already excluded by a downstream assert but is
  allowlisted here defensively.
- Add `Fluid.ContainerRuntime.DisableSubmitDuringStashedApplyThrow` kill
  switch. The UsageError is always constructed; when the kill switch is
  enabled it is sent as a `SubmitDuringStashedOpApply` error event
  instead of thrown, leaving an off-switch for production rollout.
- Enforce the PSM-documented invariant ("never resubmit during apply
  stashed ops") with an explicit early-return in `replayPendingStates`
  when `isApplyingStashedOps` is true. Closes the connection-transition
  mid-partial-drain hole where `reSubmit` for `Rejoin` / `GC` would have
  hit the submit guard.

New tests cover BlobAttach during apply (allowlisted, no throw), the
kill switch (no throw, error event logged), and `replayPendingStates`
no-op during apply.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
The `replayPendingStates` apply-window early-return suppresses replay
during apply, but `setConnectionState(true)` only fires
`replayPendingStates` on the `canSendOps` *edge*. If that edge fires
mid-apply (or while stashed entries are still being pushed into
`pendingMessages` by the apply loop), the trigger is consumed without
effect and nothing re-fires it after the window closes.

Have `onAfterStashedOpsApplied` call `replayPendingStates` after
`notifyReadOnlyState`. The method self-gates on `shouldSendOps()`, and
the PSM sets `_applyLifecycle = "ended"` before firing the hook so the
apply-window early-return won't fire either.

New test asserts the runtime's `replayPendingStates` is invoked when
the close hook fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
Revert the `replayPendingStates` apply-window early-return and the
`onAfterStashedOpsApplied` re-trigger. The canSendOps edge in
`setConnectionStateCore` is the only call site for
`replayPendingStates`, and the loader contract awaits
`applyStashedOpsAt` before transitioning the runtime to a write-capable
connection — so the mid-apply edge the defensive code was guarding
against cannot occur.

Replace the early-return with an assert that fails loudly if the
invariant is ever violated by a future loader change, surfacing the
contract rather than silently swallowing the edge.

Remove the two now-obsolete tests (`replayPendingStates is a no-op
during apply`, `close hook re-triggers replayPendingStates after apply
drains`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts Outdated
Comment thread packages/runtime/container-runtime/src/pendingStateManager.ts Outdated
- `submit()` now unconditionally sends `SubmitDuringStashedOpApply`
  as an error event on every bypass; the kill switch
  `DisableSubmitDuringStashedApplyThrow` only suppresses the throw +
  close. Telemetry attribution is identical whether or not the kill
  switch is engaged.
- Existing throw test now also asserts the error event fires, so
  symmetry with the kill-switch test is explicit.
- `PendingStateManager.applyStashedOpsAt` no longer uses a
  `loopCompleted` flag + `try/finally`. The close block (set
  `_applyLifecycle = "ended"`, fire hook) only ran when the loop
  completed without throwing — moving it inline after the loop
  preserves that property and drops the indirection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
anthony-murphy and others added 4 commits May 13, 2026 16:59
End-to-end coverage for the stashed-op apply submit guard: the original
test in `container-runtime` only verified that `submit()` throws when
`isApplyingStashedOps` is true. This drives the throw through a realistic
listener pattern and asserts the resulting load rejection.

Two SharedMaps live in the same data store. A `valueChanged` listener
on the primary map performs a cascading `set` on the secondary map.
Stashed offline ops replay on the primary during `applyStashedOpsAt`,
firing `valueChanged`; the cascading `set` on the secondary reaches
`ContainerRuntime.submit()` (the channel-level `stashedOpMd` capture
only swallows submits from the channel currently in
`applyStashedOp`, not cross-channel writes) and the guard rejects the
load.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spell out in the factory's JSDoc that submitting ops from DDS op-event
handlers is bad practice and that any cascading write must gate on both
`IContainer.readOnlyInfo.readonly` and
`IFluidDataStoreRuntime.activeLocalOperationActivity` to be safe during
stashed-op replay and rollback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DDS handlers should gate on `IFluidDataStoreRuntime.isReadOnly()` (with
the `"readonly"` event as the live signal), not on
`IContainer.readOnlyInfo.readonly` / `ContainerRuntime.isReadOnly()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Don't list the conditions under which `isReadOnly()` returns `true` —
the contract for handlers is simpler: if it's `true`, don't submit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
anthony-murphy and others added 5 commits May 13, 2026 17:12
Replace the hand-rolled `LocalDocumentServiceFactory` / `LocalResolver` /
`LocalCodeLoader` / `ContainerRuntimeFactoryWithDefaultDataStore` setup
with `createLoader` from `src/utils.ts`. Same coverage, fewer imports,
matches the convention in the rest of `local-server-tests`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the explicit `urlResolver` / `documentServiceFactory` pass-through on
the second loader. Sharing the `deltaConnectionServer` is sufficient for
URL resolution; each `createLoader` call can produce its own resolver and
driver.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `jsdoc/check-indentation` ESLint rule disallows indented continuation
lines (and numbered lists) inside JSDoc bodies. Flatten the
isReadOnly / activeLocalOperationActivity bullet list into a single
paragraph.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The submit-guard `throw` is outside the existing `submit()` try/catch, so
without an explicit `closeFn` call the container only closes when the
caller's promise chain wraps the rejection in `.catch(closeFn)`. The
inline comment and the kill-switch contract both state "throw + close",
but the code only delivered the throw — close happened transitively.

Add `this.closeFn(error)` immediately before `throw error;`. `closeFn`
is idempotent so a caller that also closes won't double-close.

Extend the existing throw test to assert `closeFn` was invoked exactly
once with the UsageError, so the contract is regression-protected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
Comment thread packages/runtime/container-runtime/src/pendingStateManager.ts
Drop `IdAllocation` from the stashed-op-apply guard allowlist. The
downstream assert 0x9a5 ("IdAllocation should be submitted directly to
outbox") already enforces that `IdAllocation` never reaches `submit()`
at all, so the allowlist entry was dead code. If that contract ever
shifts, the assert is the natural place a future author will see the
impact and decide whether to also allowlist for the apply window —
better than a forgotten reference in a comment block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anthony-murphy
Copy link
Copy Markdown
Contributor Author

Deep Review

Reviewed commit 9c50906 on 2026-05-14.

Readiness: 9/10 — ALMOST READY

All correctness concerns from the prior ten rounds are resolved at this HEAD; every one of the 14 posted inline threads is closed. Three Tier 3 polish items remain, all contained one-line edits — flagged inline.

Path to Ready

  • Resolve inline threads

Context for Reviewers

For human reviewer
  • @markfields sign-off — PSM-side changes (PendingStateManagerHooks interface, _applyLifecycle state machine, inlined try/catch rewrite of applyStashedOpsAt, empty grouped-batch handling). Lifecycle test coverage broadly meets the standing ask on shape asserts and empty-batch unit coverage.
  • @vladsud / @ChumpChief sign-offisReadOnly() now layers a third meaning ("transiently quiescing for stashed-op replay") onto the predicate PR Move op replay count telemetry to the runtime layer #16104 worked to disambiguate, and the new replayPendingStates hard assert inverts the soft-guard pattern they originally established on Fix issue #2724 and forceReadonly API while connected #4239. Author declined a separating refactor on thread #3231280593 and articulated the loader-await contract; explicit owner sign-off is the goal state.
  • @chentong7 / @alexvy86 / @Josmithr (telemetry) sign-offbaseLogger reshape stamps inStagingMode / isApplyingStashedOps / isReadOnly on every error event from downstream consumers (BlobManager, Summarizer, DataStoreContext). The subtractive narrowing of inStagingMode from properties.all to properties.error (thread #3231280638) should be audited against downstream Kusto queries and dashboards.
  • @alexvy86 / @Josmithr API-surface sign-off — the constructor parameter rename baseLoggerlogger (see inline thread on containerRuntime.ts) reverses an explicit refactor(container-runtime): Tighten up exposure of ContainerRuntime's logger #20839 review decision on the alpha public API. Owner judgment on whether this is acceptable surface motion.
  • Loader-contract assumption — the assert(!isApplyingStashedOps) in replayPendingStates relies on the loader awaiting applyStashedOpsAt before any canSendOps edge fires. Asserted in runtime code but not documented in the IRuntime/loader interface; a reviewer with loader-runtime sequencing expertise should confirm this holds for all loadContainer(pendingLocalState) paths.
  • Cross-fork / runtime determinism — cannot be assessed by the pipeline. The PR materially changes when DDSes see isReadOnly() === true during load, which can affect realize-time behavior for any DDS that branches on readonly at construction.
Review history (10 prior reviews)
  • 30e8fd3 2026-05-14 · 9/10 — both prior Tier 2 concerns resolved; two Tier 3 polish items remained inline
  • 5d8aad9 2026-05-14 · 5/10 — submit-guard closeFn Tier 2 still open; empty-batch unit-coverage gap flagged
  • 19a27ca 2026-05-13 · 5/10 — submit-guard closeFn Tier 2 still open at this HEAD
  • 80b7b9a 2026-05-13 · 4/10 — submit-guard throw path Tier 2 still open; baseLogger and isReadOnly() reshape concerns re-flagged
  • d6b5944 2026-05-13 · 5/10 — submit-guard throw path Tier 2 surfaced (closeFn not invoked)
  • e6bcc3a 2026-05-13 · 9/10 — call-order Tier 2 dissolved by revert; one Tier 3 polish item remained
  • 25ccf6d 2026-05-13 · 6/10 — call-order Tier 2 reintroduced after stranded-ops fix
  • d405321 2026-05-13 · 5/10 — sweep resolved three prior Tier 2s; new Tier 2 on stranded ops after the replayPendingStates early-return
  • bc58bea 2026-05-13 · 4/10 — three correctness concerns on the new submit() guard (BlobAttach route, Rejoin/GC reSubmit, missing kill-bit)
  • 442d94b 2026-05-13 · 5/10 — Tier 2 sharePendingBlobs → BlobAttach route into the new guard flagged; four Tier 3 polish threads still open

Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
@anthony-murphy anthony-murphy marked this pull request as ready for review May 14, 2026 21:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Introduces a guarded “stashed-op apply window” so DDSes can’t submit local ops during stashed-op replay, preventing pending-local-state mismatches and surfacing violations as attributable errors.

Changes:

  • Added a one-way apply lifecycle to PendingStateManager with a close hook when stashed ops fully drain.
  • Updated ContainerRuntime readonly aggregation and submit() to detect/handle submits during apply (with allowlist + kill switch).
  • Added unit + end-to-end tests covering lifecycle, submit behavior, and the mismatch scenario.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
packages/test/local-server-tests/src/test/submitDuringStashedOpApply.spec.ts New end-to-end repro asserting load rejects when a listener submits during stashed-op apply.
packages/runtime/container-runtime/src/test/pendingStateManager.spec.ts New lifecycle tests + adjusted existing test setup to avoid flush during apply window.
packages/runtime/container-runtime/src/test/containerRuntime.spec.ts New runtime tests for submit-during-apply behavior, allowlist, and kill switch.
packages/runtime/container-runtime/src/pendingStateManager.ts Adds apply lifecycle state + optional close hook and early return optimization.
packages/runtime/container-runtime/src/containerRuntime.ts Aggregates readonly with apply-window state, reworks baseLogger tagging, adds submit guard + replay assert, adjusts readonly notifications.

Comment on lines +152 to +153
* Synchronous: fires from a `finally` block where async behavior would
* complicate error propagation.
// recoverable state.
if (this._applyLifecycle === "applying" && this.initialMessages.isEmpty()) {
this._applyLifecycle = "ended";
this.hooks.onAfterStashedOpsApplied?.();
Comment on lines +2955 to +2959
// `channelCollection` may be undefined when invoked from the PSM's
// open hook, which fires from `new PendingStateManager(...)` earlier
// in this constructor than `channelCollection` is assigned. That's
// fine — DDSes created after this point will read the runtime's
// `isReadOnly()` aggregation and start out in the correct state.
Comment on lines +85 to +146

// 1. Create the container with two named SharedMaps, attach, write,
// disconnect, write offline, capture pending local state.
const goodFactory = new TestFluidObjectFactory(
[
["primary", SharedMap.getFactory()],
["secondary", SharedMap.getFactory()],
],
"default",
);
const {
codeDetails,
loaderProps: goodLoaderProps,
urlResolver,
} = createLoader({
deltaConnectionServer,
defaultDataStoreFactory: goodFactory,
});

const container = asLegacyAlpha(
await createDetachedContainer({ codeDetails, ...goodLoaderProps }),
);

const initialObject = (await container.getEntryPoint()) as ITestFluidObject;
const primary = await initialObject.getSharedObject<ISharedMap>("primary");
primary.set("pre-attach", "value");

await container.attach(urlResolver.createCreateNewRequest("submit-during-apply"));
primary.set("attached", "value");

const url = await container.getAbsoluteUrl("");
assert(url !== undefined, "container should have a URL after attach");

container.disconnect();
primary.set("offline", "value");

const pendingLocalState = await container.getPendingLocalState();
container.close();

// 2. Build a separate loader that, on existing=true loads, wires up a
// valueChanged listener on `primary` that performs a cascading set
// on `secondary`. Share the resolver and driver so the URL produced
// above resolves on the new loader.
const { loaderProps: reactingLoaderProps } = createLoader({
deltaConnectionServer,
defaultDataStoreFactory: new ReactingMapFactory(goodFactory),
});

// 3. The stashed `offline` op fires `valueChanged` on `primary` during
// apply; the listener's `secondary.set` reaches the runtime's
// submit guard (a different channel from the one in applyStashedOp,
// so the channel-level stashedOpMd capture doesn't swallow it), and
// the load rejects.
await assert.rejects(
loadExistingContainer({
...reactingLoaderProps,
request: { url },
pendingLocalState,
}),
(error: Error & { message?: string }) =>
error.message?.includes("Local op submitted during stashed-op apply window") === true,
);
Comment on lines +1047 to +1060
pendingStateManager.pendingMessages.push({
type: "message",
content: '{"type":"component"}',
referenceSequenceNumber: message.referenceSequenceNumber,
localOpMetadata: undefined,
opMetadata: undefined,
batchInfo: {
clientId: "CLIENT_ID",
batchStartCsn: 0,
length: 1,
staged: false,
},
runtimeOp: message.runtimeOp,
});
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants