Skip to content

Flaky: Windows E2E (Turbopack stale resolver cache) #1910

@pranaygp

Description

@pranaygp

Summary

packages/core/e2e/dev.test.ts > dev e2e > should rebuild on imported step dependency change is failing intermittently on Windows runners with Turbopack. The other E2E suites (Local Dev, Local Postgres, Local Prod, Vercel Prod) and the Linux unit suite all pass — only the Windows shard hits this.

Symptom

The test rewrites a workflow's imported step file, then polls the manifest for ~50s waiting for the new step name to appear. Every poll attempt triggers a workflow run; on the failing runs that trigger gets HTTP 500:

Failed to trigger workflow "importedStepOnlyWorkflow": 500
Error: Could not parse module '[project]/packages/core/dist/runtime/start.js', file not found

After the timeout vitest reports:

× should rebuild on imported step dependency change  ~50s
FAIL  packages/core/e2e/dev.test.ts > dev e2e > should rebuild on imported step dependency change
Error: Timed out after 50000ms waiting for manifest.json to include imported step hot-reload marker.
Last error: Failed to trigger workflow "importedStepOnlyWorkflow": 500

Root cause (already documented in the test)

The test source at packages/core/e2e/dev.test.ts:270-279 already calls this out:

Turbopack on Windows occasionally caches a stale resolver failure (e.g. Could not parse module '@workflow/core/dist/runtime/start.js') after an HMR cascade and returns 500 to every request until something invalidates its cache. Rewriting the api file is enough to force a fresh resolve on the next request, so we treat the 500 as transient and keep polling instead of bailing out.

That workaround was added in #1895 (2f52d14f3). It does help — most runs eventually recover — but on the failing runs the cache doesn't get invalidated within the 50s polling window even with the api-file rewrite.

Pre-existing, not caused by any specific PR

Run Branch Commit Windows Same failure?
25302018883 main 1203dae7 fail
25300558041 main 059821cb fail
25302467764 pgp/serialize-abort-signal 6ec2b957 fail

So this is blocking PR check status for any PR whose Windows shard happens to hit the flake — including main itself.

Suggestions

  1. Bump the cache-busting heuristic. The test currently rewrites the api file once per failed poll; could try also touching the workflows file, hitting /api/_next/clear-cache if available, or restarting the dev server after N consecutive 500s.
  2. Increase the timeout window specifically on Windows runners (e.g. if (process.platform === 'win32') timeoutMs = 120_000) — at the cost of slower CI on real failures.
  3. Mark the test as flaky on Windows with test.fails / test.runIf(process.platform !== 'win32') so it stops blocking PR check status, and track a real fix separately.
  4. Investigate Turbopack's resolver-cache invalidation — the underlying issue is that Turbopack on Windows holds a stale "module not found" entry past the point where the file is actually present. This may have a fix upstream.

In the short term option 3 unblocks PRs; longer term option 4 is the right fix.

How to reproduce

# In a Windows runner / VM
cd packages/core
pnpm vitest run e2e/dev.test.ts -t "should rebuild on imported step dependency change"

Re-run the suite a few times; expect intermittent failures with the 500 / stale-resolver error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions