Skip to content

PagePool: materialize the reserved module/command tab; plumb priority through definition lookups#4863

Merged
habdelra merged 18 commits into
mainfrom
prerender-deadlock-pagepool-priority
May 19, 2026
Merged

PagePool: materialize the reserved module/command tab; plumb priority through definition lookups#4863
habdelra merged 18 commits into
mainfrom
prerender-deadlock-pagepool-priority

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

Why

Staging investigation of stuck indexing on ctse/ambitious-piranha (914 rows) and ctse/prudent-octopus (255 rows) found every errored row hitting the same fingerprint in timing_diagnostics:

  • renderStage = waiting-stability
  • queryLoadsInFlight > 0, all with source: search-resource:*
  • cardDocLoadsInFlight = [], inFlightModuleImports = []
  • affinitySnapshot.sameAffinityActivity carrying {kind: module, queue: module, state: queued} entries on the same affinity as the stuck card render

That is the documented self-referential prerender deadlock from the indexing-diagnostics skill rubric:

  1. A card render (e.g. PolicyTrackingDashboard.json) renders and the template fires <Search @query={{filter: {type: {module: '.../cohort.gts', name: 'Cohort'}}}}> against _federated-search.
  2. The search needs the Cohort card definition to plan SQL.
  3. CachingDefinitionLookup.lookupDefinition misses (the modules table doesn't have a row for cohort.gts).
  4. It spawns a prerenderModule call for cohort.gts — same affinity.
  5. The affinity already has its tabs held by the dashboard render that's awaiting the search. The sub-module-prerender queues on the per-tab queue.
  6. The dashboard render waits on the search → which waits on the definition → which waits on the queued module render → which waits on the dashboard's tab. Deadlock.
  7. After 90s, the render-timeout fires and the row gets written as has_error.

The cycle was reproduced in 6 distinct erroring rows in ambitious-piranha and 2 in prudent-octopus. Every recent indexing job for ambitious-piranha has been rejected for ~similar reasons — the deadlock can also kill the whole batch when it gets caught by the manager-side 150s /prerender-visit timeout (job 410276 today fits that shape).

What this PR changes

Two narrowly-scoped fixes that together close the deadlock. Conceptually independent enough to review separately, but they belong in one PR because they fix the same underlying bug from two angles.

1. PagePool.#selectEntryForAffinity: materialize the reserved module/command tab

The file-admission cap (`#fileAdmissionCap = max(1, affinityTabMax - 1)`) is supposed to keep one tab's worth of global-pool headroom per affinity for module/command work. But the cap only limits file callers — it doesn't produce the reserved tab. The existing spawn branch in #selectEntryForAffinity had only three ways to actually create a tab for a non-file caller:

  1. Claim an orphan context (only if the affinity had a previously-evicted shared BrowserContext lying around).
  2. Commandeer a dormant standby (only if the standby pool happened to have a spare).
  3. The canExpandPastAffinityCap high-priority tier expansion (dormant in default config — requires PRERENDER_PAGE_POOL_HIGH_PRIORITY_MAX and PRERENDER_HIGH_PRIORITY_THRESHOLD both set).

When (1) and (2) both miss and (3) is dormant, the call falls through to #selectLeastPendingTab(entryList) and queues on the existing busy tab. That is the deadlock state.

The fix adds a fourth path between the failed commandeer and the canExpandPastAffinityCap block: if the caller is non-file (queue !== 'file') and the affinity is under cap (`entryList.length < affinityTabMax`), synchronously await #ensureStandbyPool and retry commandeer. Same shape as the brand-new-affinity branch (`entryList.length === 0`) immediately below — generalized to existing affinities with at least one already-busy tab.

The fix is independent of priority. The reservation invariant says module/command work always gets a tab when one is available under the per-affinity cap; gating that on priority would mean the invariant only holds for some callers, which is the bug, not the contract.

Why this is safe at scale:

  • #ensureStandbyPool respects #maxPages via #prepareSlotForStandby (caps total contexts at #maxPages + 1, LRU-evicts if needed). It cannot oversubscribe the global pool.
  • #commandeerDormantTab({ standbyOnly: true }) only succeeds against a real standby — no cross-affinity steal.
  • File callers still acquire #fileAdmissionCap upstream of #selectEntryForAffinity, so cross-realm fairness for file workload is unchanged.

2. CachingDefinitionLookup: plumb priority through definition sub-renders

Today getModuleDefinitionsViaPrerenderer calls prerenderer.prerenderModule({ affinityType, affinityValue, realm, url, auth }) — no priority argument. That means user-initiated reindex work (priority 10) silently downgrades to priority 0 for its definition-lookup sub-prerenders. The high-priority tier expansion (the third tab-creation path above) is gated on caller priority, so this silent downgrade also prevents the existing escape hatch from firing when it could have.

Added a narrow public option type DefinitionLookupOptions = { priority?: number } and threaded it through:

  • DefinitionLookup.getModuleCacheEntry(url, opts?) (interface + both implementations)
  • loadModuleCacheEntry / loadModuleCacheEntryUncached / loadModuleCacheEntryCoordinated (private args)
  • getModuleDefinitionsViaPrerenderer(url, realmURL, userId, priority?)
  • prerenderer.prerenderModule({ ..., priority })

prerenderModule and downstream already accepted and honored priority end-to-end (render-runner.ts passes it to getPage which uses it for per-tab queue / admission semaphore / global render semaphore priority-aware dequeue). The chain was plumbing-only on the lookup side.

LookupContext (the existing internal type with { requestingRealm }) was the wrong public surface — it carries internal lookup state callers shouldn't see. The new DefinitionLookupOptions is the narrow public option type.

Callers updated:

  • IndexRunner.preWarmModulesTable now passes { priority: this.#jobPriority }. this.#jobPriority is already threaded from the worker job's JobInfo through to IndexRunner (existing infrastructure; we just stopped throwing it away on the last hop).
  • RealmIndexQueryEngine.searchCards HTTP path: no notion of request priority on the realm-server side; passes undefined (defaults to 0), preserving current behavior.

What this does NOT do

  • Does not address the broader "pre-warm walks the wrong modules" problem — that's coming in PR B. PR B uses reader.mtimes() to enumerate every .gts/.gjs in the realm and pre-warms all of them (not just the ones in per-row deps). PR B reduces how often the page-pool deadlock-prevention path even has to fire.
  • Does not parallelize pre-warm. That's PR C, which lands after PR B's measurement establishes a baseline.
  • Does not touch clearRealmCache's post-reindex realm-wide wipe (which re-empties the modules table after every full reindex). Out-of-scope follow-up.

Tests

tests/page-pool-standby-refill-test.ts

New test: 'module call on an existing affinity with all owned tabs busy materializes a fresh tab'.

Holds a file render on affinity A (one tab, not released), then fires a module call on A. Asserts the module call resolves promptly with a different pageId than the held tab — both that no deadlock occurred and that the module render landed on its own tab. Bounded by a 2s timer so a regression fails the test cleanly instead of hanging.

tests/definition-lookup-test.ts

Two new tests:

  • 'getModuleCacheEntry forwards priority to prerenderModule' — captures the priority arg the mock prerenderer receives. Asserts:
    • An explicit { priority: 10 } forwards as 10.
    • A subsequent call on a cached entry short-circuits at the cache (priority isn't even consulted).
    • An invalidation + omitted priority forwards as undefined (which the prerender server treats as 0).
  • 'getModuleCacheEntry caches non-card modules as no-card markers' — direct contract test for the no-card-marker behavior PR B's cost analysis depends on. Stubs the mock prerenderer to return { status: 'ready', definitions: {} }. Asserts:
    • The empty-definitions row persists to the modules table (verified by direct SQL).
    • A second getModuleCacheEntry short-circuits at the cache without re-invoking the prerenderer.

Both tests live in the existing definition-lookup-test.ts module structure and run in <100ms each.

Verification

Local (full realm-server suite takes too long for in-loop validation — relying on CI for full regression):

```
$ pnpm --filter @cardstack/realm-server run test --filter 'DefinitionLookup'
ok 1 .. 28 (all 28 DefinitionLookup tests pass, including 2 new)
$ pnpm --filter @cardstack/realm-server run test --filter 'module call on an existing'
ok 1 page-pool-standby-refill-test.ts > module call on an existing affinity with all owned tabs busy materializes a fresh tab
```

Both pnpm --filter @cardstack/runtime-common run lint and pnpm --filter @cardstack/realm-server run lint:js clean for the files this PR touches. (The pre-existing prettier baseline in generated realms/.../bxl-chunks/*.ts files is untouched by this PR.)

After merge + staging deploy, verification will be:

  • Trigger _full-reindex on ctse/prudent-octopus and ctse/ambitious-piranha.
  • Check boxel_index rows for those realms: `SELECT count(*) FILTER (WHERE has_error) FROM boxel_index WHERE realm_url = ?` — expect zero (vs. 8 today across the two realms).
  • Spot-check timing_diagnostics->'affinitySnapshot'->'sameAffinityActivity' on the heaviest rows (dashboards, cohorts): no {state: queued} entries.
  • Grep prerender-server logs for the dashboard's requestId=…: confirm any sub-module-prerender ran on a fresh pageId distinct from the dashboard's pageId.

Test plan

  • CI passes on the new branch
  • Staging deploy + _full-reindex of ctse/prudent-octopus lands with zero errored rows
  • Staging deploy + _full-reindex of ctse/ambitious-piranha lands with zero errored rows
  • Spot-check affinitySnapshot.sameAffinityActivity on the heaviest rows post-fix — no state: queued entries

🤖 Generated with Claude Code

… through definition lookups

Two narrowly-scoped fixes for the self-referential prerender deadlock that
surfaces when a card render fires `_federated-search`, the search needs a
card definition not yet in the modules table, `CachingDefinitionLookup`
spawns a `prerenderModule` for the missing module, and that sub-prerender
queues behind the very tab the file render is holding.

1. PagePool#selectEntryForAffinity: when a non-file caller enters the
   spawn branch and both orphan-claim + standby-commandeer miss, await
   `#ensureStandbyPool` synchronously and retry commandeer — regardless
   of priority and regardless of whether the high-priority tier is
   configured. The file-admission cap reserves global-pool headroom for
   one module/command slot per affinity; this change actually produces
   the tab that headroom was being held for. Mirrors the existing brand-
   new-affinity branch (entryList.length === 0) one branch further down.
   `#ensureStandbyPool` respects `#maxPages` via `#prepareSlotForStandby`,
   so this can't oversubscribe the global pool.

2. `CachingDefinitionLookup.getModuleDefinitionsViaPrerenderer`: thread
   priority through `prerenderModule(...)`. Today the call drops priority
   on the floor — user-initiated reindex work (priority 10) silently
   downgrades to priority 0 for its definition-lookup sub-prerenders.
   `LookupContext` was the wrong public surface (it carries
   `requestingRealm`, internal); added a narrower `DefinitionLookupOptions`
   type with `{ priority?: number }` instead and threaded it through
   `getModuleCacheEntry` → `loadModuleCacheEntry{,Uncached,Coordinated}` →
   `getModuleDefinitionsViaPrerenderer` → `prerenderModule`. IndexRunner's
   pre-warm call site passes `this.#jobPriority`.

Tests:
- `page-pool-standby-refill-test.ts`: held file render on affinity A + a
  module call on A produces a fresh tab synchronously, not queued on
  the busy tab. Bounded with a 2s timeout so a regression fails cleanly
  instead of hanging.
- `definition-lookup-test.ts`: priority forwards end-to-end; cached
  entry skips re-prerender even when priorities differ; default
  priority forwards as `undefined` (which the prerender server treats
  as 0). Non-card-module path: `definitions: {}` persists to the modules
  table and short-circuits subsequent lookups, confirming the no-card-
  marker behavior the pre-warm cost analysis depends on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested a review from Copilot May 18, 2026 18:30
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1cf2626ea3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/realm-server/prerender/page-pool.ts Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a prerender/indexing deadlock where same-affinity module/command sub-renders can get queued behind a busy file render that is itself waiting on the sub-render (self-referential wait). It does this by ensuring a reserved module/command tab is actually materialized when available, and by propagating job priority through definition-lookup-triggered sub-prerenders so higher-priority work doesn’t silently downgrade.

Changes:

  • PagePool: add a synchronous standby-refill + retry-commandeer path for non-file callers under the per-affinity cap, to ensure the reserved module/command tab is actually created.
  • DefinitionLookup: introduce DefinitionLookupOptions { priority?: number } and thread it through to prerenderer.prerenderModule({ priority }).
  • Tests: add coverage for the standby-refill behavior and for priority plumbing + empty-definitions (“no-card marker”) caching.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/runtime-common/index-runner.ts Passes the indexing job priority into module pre-warm definition lookups.
packages/runtime-common/definition-lookup.ts Adds DefinitionLookupOptions and forwards priority through module cache population/prerender.
packages/realm-server/prerender/page-pool.ts Adds a refill+retry path to materialize a standby for non-file callers under the affinity cap.
packages/realm-server/tests/page-pool-standby-refill-test.ts Adds a regression test intended to ensure a module call on a busy affinity gets a fresh tab.
packages/realm-server/tests/definition-lookup-test.ts Adds tests for priority forwarding and caching empty-definitions module rows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/realm-server/prerender/page-pool.ts
Comment thread packages/realm-server/tests/page-pool-standby-refill-test.ts
The `current < desired` guard around the new synchronous refill in
#selectEntryForAffinity skipped the await when a refill was already in
flight. `#currentStandbyCount() = #standbys.size + #creatingStandbys`,
so when a file render just consumed the only standby and its post-
acquire kickStandbyRefill is mid-creation, creatingStandbys > 0
inflates current to match desired even though no real standby exists
yet. The follow-up commandeer then fails and the caller falls through
to the busy-tab branch — exactly the deadlock the change was meant to
prevent.

Drop the guard. #ensureStandbyPool deduplicates internally via
#ensuringStandbys, so an unconditional await is a no-op when there is
nothing to do and waits on the in-flight refill when there is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symmetry with the previous commit. `#selectEntryForAffinity`'s
brand-new-affinity branch (entryList.length === 0) had the same
`current < desired` guard around its `await #ensureStandbyPool()`.
Same race applies: `#currentStandbyCount` includes
`#creatingStandbys`, so a refill in flight on another affinity (or
this one's pre-acquire kick) can inflate `current` to match `desired`
before any real standby exists. The brand-new caller would then skip
the wait and fall through to cross-affinity steal rather than landing
on the freshly-warmed standby a peer just kicked off — defeating the
"prefer fresh standby over busy-tab queueing" intent documented in
the surrounding comment.

Drop the guard. Same reasoning as the non-file branch above: the
`#ensuringStandbys` dedup makes the unconditional await a no-op when
the pool is healthy and a wait when it isn't.

Also broadens the comment on the non-file branch to clarify the
no-op vs. wait behavior and note that the at-cap-without-expansion-
budget case still falls through to busy-tab — that residual deadlock
requires either operator capacity tuning or the high-priority tier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

Preview deployments

Host Test Results

    1 files      1 suites   1h 48m 29s ⏱️
2 673 tests 2 658 ✅ 15 💤 0 ❌
2 692 runs  2 677 ✅ 15 💤 0 ❌

Results for commit de0548f.

Realm Server Test Results

    1 files  ±    0      1 suites  +1   8m 44s ⏱️ + 8m 44s
1 414 tests +1 414  1 414 ✅ +1 414  0 💤 ±0  0 ❌ ±0 
1 501 runs  +1 501  1 501 ✅ +1 501  0 💤 ±0  0 ❌ ±0 

Results for commit de0548f. ± Comparison against earlier commit 572f941.

habdelra and others added 5 commits May 18, 2026 15:41
…der); fix queues-same-realm regression

Two changes addressing review feedback on the earlier PR A commits.

1. Priority threading through the lookupDefinition path

The PR's previous priority plumbing covered `getModuleCacheEntry` (the
pre-warm path) but not `lookupDefinition`, which is the path that fires
sub-`prerenderModule` calls from `_federated-search` during a card
render — i.e. the actual deadlock site. This closes the gap.

Chain:
- `runtime-common/prerender-headers.ts`: new `X_BOXEL_JOB_PRIORITY_HEADER`
  constant + `sanitizeJobPriorityHeader` sanitizer. Same scale as
  worker-job priority (0/10 etc), non-negative safe integer.
- `realm-server/prerender/prerender-constants.ts`: re-exports as
  `PRERENDER_JOB_PRIORITY_HEADER` alongside `PRERENDER_JOB_ID_HEADER`.
- `remote-prerenderer.ts`: producer sends the header on every outbound
  prerender request (visit / module / command / screenshot) when the
  attributes include `priority`.
- `render-runner.ts`: prerenderVisitAttempt's CDP page.evaluate now
  injects `globalThis.__boxelJobPriority` alongside the existing
  `__boxelJobId`, scoped to the render.
- `host/app/services/store.ts`: `jobPriorityHeader()` reads
  `__boxelJobPriority` and stamps `X-Boxel-Job-Priority` on outbound
  `_federated-search` fetches alongside the existing
  `X-Boxel-Job-Id` / `X-Boxel-Consuming-Realm` plumbing.
- `realm-server/handlers/handle-search.ts`: extracts and sanitizes the
  header, threads `priority` into the search opts passed to
  `searchRealms`.
- `realm-index-query-engine.ts`: `Options` carries `priority`;
  `lookupDefinitionForOpts` and the `_addCardConditions` populate-
  query-fields paths thread it into `lookupDefinition(codeRef, opts)`.
- `definition-lookup.ts`: `DefinitionLookup.lookupDefinition` now
  accepts the existing `DefinitionLookupOptions` shape;
  `LookupContext` carries `priority`; `lookupDefinitionWithContext`
  threads it into the `loadModuleCacheEntry` args alongside the
  existing fields. The `RealmScopedDefinitionLookup` wrapper forwards
  opts to `lookupDefinitionForRealm`.

User / API callers don't stamp the header (they aren't inside a
prerender tab), so live traffic continues to fall back to priority 0
— same observable behavior as today. Indexer-driven card renders fire
their sub-prerenders at the originating job's priority, which is
useful for observability (timing_diagnostics.priority on errored rows)
and for the high-priority-tier expansion path when those env vars are
configured.

2. Revert brand-new-affinity branch's unconditional await

The earlier commit removed the `current < desired` gate on the
brand-new-affinity branch's `await #ensureStandbyPool()` in
`#selectEntryForAffinity`, mirroring the same fix applied to the new
non-file spawn-branch. Code-review feedback suggested it was symmetric
— but it isn't, and CI caught the regression:
`prerendering-test.ts > queues same-realm request when tab is
transitioning` now times out.

Root cause: the unconditional await yields one extra microtask even
when `current >= desired` already. In the brand-new-affinity case
that microtask shifts the ordering against a concurrent same-affinity
file caller arriving on an idle tab — the same caller bumps
`#queue.pendingCount` past 1 before the cross-affinity-steal scan,
the scan's `pendingCount > 1` filter rejects, and the brand-new
caller throws `'No standby page available for prerender'` instead of
queueing.

The gate is asymmetric on purpose:
- Non-file spawn-branch (entryList.length > 0, queue != 'file'):
  await unconditionally — the cost of skipping is unbounded (the
  caller queues on the very tab whose work blocks it).
- Brand-new-affinity branch (entryList.length === 0): keep the gate
  — the cost of skipping is bounded (a cross-affinity-steal hop,
  itself a designed fallback). Adding the extra microtask actively
  breaks an existing test that relies on the synchronous ordering.

Comment updated to explain the asymmetry; the test that caught this
is named in the comment for future reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`preWarmModulesTable` previously walked each invalidation's
`boxel_index.deps` and added any executable URLs it found. That misses
modules referenced by *string* in card templates — a typical pattern in
dashboard cards:

  <Search @query={{filter: {type: {module: '.../cohort.gts',
                                   name: 'Cohort'}}}}>

`cohort.gts` is referenced as a string parameter to the search filter,
never imported into the dashboard's runtime module graph. So it never
appears in the dashboard row's `deps`, never gets pre-warmed, and when
the dashboard renders during indexing the `_federated-search` call
fires a same-affinity `prerenderModule` for `cohort.gts` mid-card-
render. That sub-prerender queues behind the dashboard's tab — the
self-referential prerender deadlock.

Add a realm-wide sweep on top of the existing per-row deps walk. Source
is the filesystem-mtimes map: from-scratch already pays for the walk
via `discoverInvalidations`, so we just filter and reuse it; incremental
adds a fresh `reader.mtimes()` call before pre-warm (typical ~200 ms on
a 500-file realm, one call per job, completely dominated by the
indexing work that follows). Filter to `.gts` + `.gjs` only — cards can
only live in template-bearing modules, so `.ts`/`.js` helpers are
correctly excluded from the realm-wide layer. Helpers that ARE needed
by a card already get pre-warmed through the existing per-row deps
walk; the realm-wide layer is purely additive on top.

Adds `hasCardExtension` / `cardExtensions` next to
`hasExecutableExtension` / `executableExtensions` in
`runtime-common/index.ts`, same shape — `.gts` and `.gjs` are the
extensions a `CardDef` / `FieldDef` can live in because they're the
ones with a Glimmer template surface.

The modules cache treats `definitions: {}` (an empty result from a
non-card module — e.g. a rare `.gts` helper) as a valid cache hit at
`definition-lookup.ts:561-563`. So the worst case is one wasted
prerender per non-card `.gts` per `clearRealmCache` cycle; subsequent
lookups skip the prerender entirely. That contract is pinned by
`getModuleCacheEntry caches non-card modules as no-card markers` in
PR A.

Cost analysis for ambitious-piranha (the staging realm where the
deadlock fired): 9 `.gts` files + 9 `.ts` files = 18 executables, but
only 9 `.gts` files in the realm-wide sweep (the `.ts` files are
`bxl/*` helpers). Pre-warm is still serial (concurrency comes in a
follow-up PR after baseline measurement). Roughly ~200 ms per file
× 9 files = ~1.8 s additional pre-warm cost per from-scratch reindex.

Stacked on top of PR #4863 (PagePool tab-materialization + priority
plumbing). The two changes attack the same bug from different angles:
PR #4863 makes sure the deadlock cannot happen when a sub-prerender
DOES fire; this PR makes sure it almost never has to fire in the first
place. PR #4863 is the load-bearing fix; this is the preventive layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
main renamed the modules-cache surface (clearRealmCache →
clearRealmDefinitions, ModuleCacheEntry → DefinitionCacheEntry,
getModuleCacheEntry → getCachedDefinitions, etc). Re-applied the
priority-threading additions (`DefinitionLookupOptions` arg,
`LookupContext.priority` field, plumb-through to
`loadDefinitionCacheEntry`) on top of the renamed surface.

Tests updated to call `getCachedDefinitions` rather than
`getModuleCacheEntry`. No behavioral changes from the rename — the
priority plumbing now uses the new method/type names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion change to PR #4863. That PR plumbed `x-boxel-job-priority`
through every layer of the prerender / search stack, but only the
producer-side prerender tab was stamping the header — user-facing
fetches (the host SPA in a real user's browser, external API callers)
sent no priority, so any sub-`prerenderModule` fired by
`_federated-search` for a `lookupDefinition` cache miss queued behind
concurrent background indexing at priority 0. User clicks that drive
search are user-initiated work and should outrank background fan-out.

Policy on the outbound `jobPriorityHeader()`:

1. Inside a prerender tab (`__boxelDuringPrerender === true`): forward
   `__boxelJobPriority` as-is, including 0. The render-runner sets the
   global from the worker job's priority; a 0 is meaningful (the
   originating job is system-initiated indexing) and must not be
   upgraded. Falls back to 0 if the global is absent.

2. Outside a prerender tab: stamp `userInitiatedPriority` (10) unless
   the caller has explicitly set `globalThis.__boxelJobPriority` to
   something else.

Why gate on `__boxelDuringPrerender` rather than the presence of
`__boxelJobPriority`: a prerender tab whose render-runner build
predates the priority global, or a test fixture that didn't set it,
should still be treated as prerender-context (default 0), not as
live user traffic (default 10).

Stacked on top of #4863 — that PR adds `X_BOXEL_JOB_PRIORITY_HEADER`
and the `userInitiatedPriority` constant import is already wired
through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
habdelra and others added 3 commits May 18, 2026 15:59
Drops the `hasCardExtension` helper added in the previous commit.
Copilot's review caught a real case I'd missed:
`packages/catalog-realm/commands/collect-submission-files.ts` defines
`class CollectSubmissionFilesInput extends CardDef` — a card-defining
`.ts` file, no Glimmer template. The `.gts`/`.gjs` extension filter
would have skipped it on every realm-wide pre-warm sweep, and any
`_federated-search` filter referencing it as a `type:` would have
fired a same-affinity `prerenderModule` mid-card-render — the exact
deadlock pattern this PR aims to prevent.

Fix: realm-wide sweep now uses the existing `hasExecutableExtension`
(`.gts` / `.gjs` / `.ts` / `.js`, excluding `.d.ts`). Non-card modules
pre-warmed by the sweep persist to the modules table as empty-
definitions rows (no-card markers — `definition-lookup.ts:561-563`),
so subsequent `getCachedDefinitions` / `lookupDefinition` calls
short-circuit at the cache without re-firing the prerenderer. That
contract is pinned by the `getCachedDefinitions caches non-card
modules as no-card markers` test in PR #4863.

Variable renamed `allRealmCardModules` → `allRealmExecutables` and
comments updated. The `hasCardExtension` helper + `cardExtensions`
constant are removed entirely.

Cost analysis update: ambitious-piranha goes from 9 candidates (just
the `.gts` files) to 18 (9 `.gts` + 9 `.ts` helpers in `bxl/`). At
~200 ms per prerender × 18 modules with serial pre-warm = ~3.6 s
additional cost on top of the from-scratch reindex. After the first
sweep, the modules table caches the helper rows; subsequent reindexes
only pay for modules that actually have card defs. Bounded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues from the Copilot review + Hassan's comment on PR #4866.

1. (Copilot) Reword doc comment to not imply external HTTP callers
   can affect this global. The helper only runs inside the host SPA's
   JS runtime. External HTTP callers — anything not in the host SPA —
   bypass the helper entirely and set `X-Boxel-Job-Priority` directly
   on their request if they care.

2. (Copilot) Tighten gating from truthy `if (duringPrerender)` to
   `if (duringPrerender === true)`. The global is typed as a boolean;
   a stray truthy string from a future code path shouldn't silently
   flip the policy from "user priority" to "preserve 0" — a real
   user-facing fetch would then queue behind background indexing.
   Strict-equality is the safer default.

3. (Hassan) Pin the policy with unit tests. Refactored the inline
   policy logic into an exported pure helper
   `resolveOutboundJobPriority({duringPrerender, jobPriority})` and
   wrote `tests/unit/job-priority-header-test.ts` covering:

   Outside prerender (user / API caller):
     - No globals → returns `userInitiatedPriority`
     - `__boxelDuringPrerender = false` → `userInitiatedPriority`
     - Explicit `__boxelJobPriority` override is honored (3 → 3,
       0 → 0; 0 is NOT coerced to user priority)
     - Truthy but non-`true` `__boxelDuringPrerender` (`'yes'`, `1`)
       does NOT trigger forward-as-0 — guards against the issue
       above

   Inside prerender:
     - Explicit 10 → 10
     - Explicit 0 → 0 (must not upgrade; preserves system-initiated)
     - Missing `__boxelJobPriority` → 0 (safe default for prerender-
       context work)
     - Malformed values (-1, 1.5, '10') fall through to the active
       branch's default

`jobPriorityHeader` (the wrapper that reads the globals) is now a
thin shim that calls `resolveOutboundJobPriority` with the actual
globals. No behavioral change at the call sites.

Note: I considered tightening `duringPrerenderHeaders()` similarly
per Copilot's "and similarly in related helpers" suggestion, but
left it alone — changing the existing `X-Boxel-During-Prerender`
gating is broader scope than this PR and would change observable
behavior at the realm-server's cache-only-definitions gate. Worth
a separate cleanup if we want strict-boolean across all the
related helpers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the previous "broaden to all executables" commit. The case
that broadening fixed — `.ts` files that host CardDef (rare; the only
example in the repo today is command-input cards) — is handled
correctly by the on-demand `lookupDefinition` read-through during the
visit. The PagePool's new tab-materialization for module/command
callers makes that on-demand path safe (the sub-prerender gets its
own tab instead of queueing behind the render that triggered it), so
the cost trade favors the narrow sweep:

- Realm-wide pre-warm of `.gts` / `.gjs` only: ~9 prerenders × ~200 ms
  on a typical card-heavy realm.
- Broader sweep including `.ts` / `.js` helpers: ~18+ prerenders, most
  of them producing no-card-marker rows for files that almost never
  define a card.

This is an optimization on every reindex, not a correctness change.
Comments updated to call out the trade-off honestly: `.ts` / `.js`
CAN host card defs, the narrow filter exists because the on-demand
path is now safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested a review from a team May 18, 2026 20:51
habdelra and others added 3 commits May 18, 2026 17:33
The new x-boxel-job-priority request header needs to be listed in
Access-Control-Allow-Headers so the browser's CORS preflight passes
when prerender contexts on origin :4200 fetch :4201's _federated-search
endpoint. Without this, the OPTIONS preflight succeeds (no header
restriction) but the follow-up QUERY is blocked by Chrome with
"Request header field x-boxel-job-priority is not allowed by
Access-Control-Allow-Headers in preflight response", so every search
fired by host code in a prerendering context silently fails. The
priority routing feature this PR introduces depends on those searches
landing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rity' into prerender-deadlock-user-api-priority
@backspace
Copy link
Copy Markdown
Contributor

The test plan suggests this:

  • Staging deploy + _full-reindex of ctse/prudent-octopus lands with zero errored rows
  • Staging deploy + _full-reindex of ctse/ambitious-piranha lands with zero errored rows

but it doesn’t look like that happened?

@habdelra
Copy link
Copy Markdown
Contributor Author

habdelra commented May 19, 2026

The test plan suggests this:

  • Staging deploy + _full-reindex of ctse/prudent-octopus lands with zero errored rows
  • Staging deploy + _full-reindex of ctse/ambitious-piranha lands with zero errored rows

but it doesn’t look like that happened?

with all the merges that happen (especially the day before sprint demo), i'm unsure I could get thru a full reindex without an independent merge reverting my deploy. i think its easier to merge, test, and then adjust. our workers are setup such that if indexing gets interrupted by a deploy, it should pick up where it left off after the deploy is finished

habdelra added 2 commits May 19, 2026 09:06
…gepool-priority

# Conflicts:
#	packages/host/app/services/store.ts
#	packages/realm-server/server.ts
…rity' into prerender-deadlock-user-api-priority

# Conflicts:
#	packages/host/app/services/store.ts
…riority

Host: stamp user priority on outbound _federated-search by default
…ealm-modules

Indexer: pre-warm every realm .gts/.gjs module, not just per-row deps
@habdelra habdelra merged commit 4bd2d4f into main May 19, 2026
74 of 75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants