Skip to content

chore(container-runtime): remove DisableBatchIdTracking kill-switch#27296

Open
dannimad wants to merge 2 commits into
microsoft:mainfrom
dannimad:remove-disable-batch-id-tracking
Open

chore(container-runtime): remove DisableBatchIdTracking kill-switch#27296
dannimad wants to merge 2 commits into
microsoft:mainfrom
dannimad:remove-disable-batch-id-tracking

Conversation

@dannimad
Copy link
Copy Markdown
Contributor

Description

Removes the Fluid.ContainerRuntime.DisableBatchIdTracking config kill-switch and ties batchId tracking (which powers DuplicateBatchDetector and batchId stamping on resubmit) to the existing Fluid.Container.enableOfflineFull opt-in. The kill-switch had no known consumers, and enableOfflineFull is the natural off-ramp if a regression in batchId tracking appears.

Behavior change: containers that do not opt into Offline Load no longer run DuplicateBatchDetector. Forked-container duplicate detection now requires the Offline Load opt-in. The internal class member batchIdTrackingEnabled is renamed to offlineEnabled to reflect its new single source of truth.

Reviewer Guidance

The review process is outlined on this wiki page.

  • The semantic narrowing (DuplicateBatchDetector now offline-only) is the load-bearing call. Removing only the kill-switch wouldn't gate that — the rename + simplification only make sense if the narrowing is acceptable.
  • Two existing tests (Process empty batch, Can roundtrip DuplicateBatchDetector state through summary/snapshot) now set enableOfflineFull: true because they exercise behavior that is now gated.
  • fewerBatches.spec.ts "Op reentry submits two batches" previously used the kill-switch to suppress the detector for an unrelated artificial sequence-number reuse; it now relies on the default-off behavior with no flag.

Removes the Fluid.ContainerRuntime.DisableBatchIdTracking config kill-switch
and gates batchId tracking on the existing Fluid.Container.enableOfflineFull
opt-in. The internal class member is renamed to offlineEnabled to reflect
its single source of truth. Containers that do not opt into Offline Load
no longer run DuplicateBatchDetector; the natural off-ramp for a batchId
tracking regression is now to disable enableOfflineFull.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

Hi! Thank you for opening this PR. Want me to review it?

Based on the diff (117 lines, 5 files), I've queued these reviewers:

  • Correctness — logic errors, race conditions, lifecycle issues
  • Security — vulnerabilities, secret exposure, injection
  • API Compatibility — breaking changes, release tags, type design
  • Performance — algorithmic regressions, memory leaks
  • Testing — coverage gaps, hollow tests

How this works

  • Adjust the reviewer set by ticking/unticking boxes above. Reviewer toggles alone don't trigger anything.

  • Tick Start review below to dispatch the review fleet.

  • After review finishes, tick Start review again to request another run — it auto-resets after each dispatch.

  • This comment updates as new commits land; your reviewer selections are preserved.

  • Start review

@dannimad dannimad marked this pull request as ready for review May 13, 2026 00:11
@dannimad dannimad requested a review from a team as a code owner May 13, 2026 00:11
Copilot AI review requested due to automatic review settings May 13, 2026 00:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the Fluid.ContainerRuntime.DisableBatchIdTracking kill-switch and makes batchId tracking (including DuplicateBatchDetector and batchId stamping on resubmit) conditional on the existing Offline Load opt-in (Fluid.Container.enableOfflineFull). This narrows duplicate batch detection to the “forked container via Offline Load” scenario and updates tests/documentation accordingly.

Changes:

  • Removed the DisableBatchIdTracking config kill-switch and gated batchId tracking on Fluid.Container.enableOfflineFull.
  • Updated container-runtime tests to explicitly opt into Offline Load when exercising batchId tracking / duplicate detection behavior.
  • Updated end-to-end test to rely on default-off detector behavior (no longer needs the kill-switch).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
packages/test/test-end-to-end-tests/src/test/fewerBatches.spec.ts Removes use of the kill-switch in the out-of-order op test and updates commentary to reflect Offline Load gating.
packages/runtime/container-runtime/src/test/containerRuntime.spec.ts Updates expectations and opts tests into Offline Load where batchId tracking / duplicate detection is required.
packages/runtime/container-runtime/src/containerRuntime.ts Replaces kill-switch + default-on tracking with enableOfflineFull as the single source of truth; renames the internal flag accordingly.
.changeset/calm-batches-track.md Adds a release note entry describing the removal and behavior change.
Comments suppressed due to low confidence (1)

packages/runtime/container-runtime/src/containerRuntime.ts:1908

  • Same as above: this UsageError message refers to "Offline mode" while the rest of the code/comments refer to the "Offline Load" feature. Updating the wording would make the error clearer and consistent.
		if (this.offlineEnabled && !this.groupedBatchingEnabled) {
			const error = new UsageError("Offline mode requires grouped batching to be enabled");
			this.closeFn(error);
			throw error;

Comment on lines +6 to +10
Remove the `Fluid.ContainerRuntime.DisableBatchIdTracking` config kill-switch and gate batchId tracking on the Offline Load opt-in

The internal `Fluid.ContainerRuntime.DisableBatchIdTracking` config flag has been removed. It was previously used as a kill-switch to suppress batchId stamping and `DuplicateBatchDetector` activation when both `FlushMode.TurnBased` and grouped batching were enabled. The flag is no longer needed: batchId tracking is now enabled iff the Offline Load feature is opted into via `Fluid.Container.enableOfflineFull`, which is also the natural off-ramp if a regression is observed.

Containers that do not opt into Offline Load no longer run `DuplicateBatchDetector`. Forked-container duplicate detection now requires the Offline Load opt-in.
Comment on lines +1900 to 1903
if (this.offlineEnabled && this._flushMode !== FlushMode.TurnBased) {
const error = new UsageError("Offline mode is only supported in turn-based mode");
this.closeFn(error);
throw error;
// sent as a placeholder grouped batch to preserve their batchId (see
// OpGroupingManager.createEmptyGroupedBatch / outbox.flushEmptyBatch).
this.offlineEnabled =
this.mc.config.getBoolean("Fluid.Container.enableOfflineFull") === true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: The new runtime gate is strictly narrower than the loader's offline opt-in, which silently regresses fork detection for Fluid.Container.enableOfflineLoad consumers.

The divergence. Runtime now derives offlineEnabled from a single config:

this.offlineEnabled = this.mc.config.getBoolean("Fluid.Container.enableOfflineFull") === true;

But the loader treats offline-load as enabled when any of three signals is set (packages/loader/container-loader/src/container.ts:964-968):

const offlineLoadEnabled =
    this.isInteractiveClient &&
    (this.mc.config.getBoolean("Fluid.Container.enableOfflineLoad") ??
     this.mc.config.getBoolean("Fluid.Container.enableOfflineFull") ??
     options.enableOfflineLoad !== false);

The first arm — Fluid.Container.enableOfflineLoad — is a live (non-deprecated) config flag. Only the third typed-options arm is @deprecated Do not use (packages/common/container-definitions/src/loader.ts:608-616).

Concrete consequence. A host opted into offline load via Fluid.Container.enableOfflineLoad continues to wire up SerializedStateManager / closeAndGetPendingLocalState at the loader, but the runtime no longer instantiates DuplicateBatchDetector (this file, the if (this.offlineEnabled) allocation) and no longer stamps batchId on resubmit (batchId: this.offlineEnabled ? batchId : undefined). The exact forked-container scenario DuplicateBatchDetector was built to catch (#21767, #22497) goes silently undetected.

Why pre-PR was safe. Pre-PR, runtime tracking was default-on (TurnBased + grouped batching ⇒ tracking on), accidentally papering over the divergence so enableOfflineLoad consumers still got the detector. This PR removes that safety net without unifying the gates or documenting the migration. The changeset only states "Forked-container duplicate detection now requires the Offline Load opt-in" — implying a single opt-in when there are three at the loader layer.

Historical. tyler-cai-microsoft flagged this exact multi-flag/loader-runtime back-compat concern in #21767 inline on containerRuntime.ts:4134 (2024-07-10); markfields replied "I'll follow up... since none of this has shipped yet." The follow-up never landed; this PR cements the divergence.

Precedent for the union. packages/loader/container-loader/src/snapshotRefresher.ts:98-101 already accepts either Fluid.Container.enableOfflineSnapshotRefresh or Fluid.Container.enableOfflineFull — the codebase has a pattern for honoring multiple offline opt-ins at downstream layers.

Pick one:

  • (a) Widen the runtime gate to enableOfflineLoad ?? enableOfflineFull, mirroring snapshotRefresher.ts:98-101. Lowest-risk — preserves pre-PR behavior for legacy-flag consumers without re-introducing the kill-switch.
  • (b) Propagate the loader's resolved offlineLoadEnabled (the full three-arm union from container.ts:964-968) into the runtime via IContainerContext so both layers always agree. Bigger surface change, eliminates the divergence permanently.
  • (c) If the narrowing is intentional, expand .changeset/calm-batches-track.md and the offlineEnabled doc-comment (containerRuntime.ts:1457-1464) to call out that hosts using Fluid.Container.enableOfflineLoad (or relying on IContainerLoadMode.enableOfflineLoad: true) must migrate to Fluid.Container.enableOfflineFull to retain DuplicateBatchDetector and resubmit batchId stamping. Add a UsageError or telemetry warning when the loader has serialized-pending-state plumbing engaged but the runtime gate is off. If you go this route, also clarify the outbox.ts:358-368 "we do it always" comment to reference enableOfflineFull, since flushAll's placeholder grouped-batch path is now transitively gated on the same flag.

Question for you: for consumers currently enabling offline load via Fluid.Container.enableOfflineLoad (the config flag, not the deprecated typed option), is the intent that they migrate to Fluid.Container.enableOfflineFull, or should the runtime honor the union?

@anthony-murphy
Copy link
Copy Markdown
Contributor

Deep Review

Reviewed commit 2cf9384 on 2026-05-12.

Readiness: 5/10 — MAKING PROGRESS

Not ready for sign-off. The single-flag runtime gate (Fluid.Container.enableOfflineFull) is still strictly narrower than the loader's three-signal offline-load union, so consumers opted in via the live Fluid.Container.enableOfflineLoad config or the default-true IContainerOptions.enableOfflineLoad arm silently lose DuplicateBatchDetector and resubmit batchId stamping. Same finding as the prior review; the inline thread on containerRuntime.ts:1898 is unanswered and the new commit does not address it. Score slips one step to reflect that.

Path to Ready

  • Resolve inline threads — pick option (a) widen the runtime gate, (b) plumb the loader's resolved offlineLoadEnabled through IContainerContext, or (c) keep the narrowing and document the migration + add a telemetry/UsageError warning when loader-side serialization plumbing is engaged but offlineEnabled is false (and clarify the outbox.ts:358-368 "we do it always" comment to reference enableOfflineFull).
  • Add a regression test for the uncovered arm — enableOfflineLoad: true (or the default-true options.enableOfflineLoad path) without enableOfflineFull, asserting the chosen option's behavior (detector either present (a/b) or explicitly absent with telemetry (c)).
  • Expand .changeset/calm-batches-track.md to call out enableOfflineLoad (config) and IContainerOptions.enableOfflineLoad (public option) explicitly — the current text says "Forked-container duplicate detection now requires the Offline Load opt-in" as if there's one opt-in when the loader recognizes three.

Context for Reviewers

For human reviewer
  • Needs human judgment — Is the runtime/loader gate divergence intentional (callers should migrate to enableOfflineFull) or accidental (the runtime should honor the same union)? markfields and anthony-murphy own that call; tyler-cai-microsoft originally raised the concern in Offline: Add batchId to batch metadata on resubmit #21767 and should weigh in given the explicit "I'll follow up" deferral that this PR effectively closes.
  • Cannot be assessed by the pipeline — Confirm the offline e2e tests in stashedOps.spec.ts (Single-Threaded Fork 2205-2246, Parallel Forks 2249-2380, serial double-hydration 2383-2436) still pass under the new gate. Given the file's churn profile, a green-CI confirmation is worthwhile before merge.
Review history (1 prior review)
  • 9957a02 2026-05-12 · 6/10 — single-flag runtime gate strictly narrower than the loader's three-signal union; one inline thread

@anthony-murphy
Copy link
Copy Markdown
Contributor

@dannimad i don't think we should remove this yet. the feature hasn't even shipped yet. the kill switch exists for use to turn this off if we see issues running the feature. we can't remove it until we are confident the feature is working well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants