fix: abort waitFor server_interactive tests promptly on file worker fatal error#13710
Open
nomeata wants to merge 3 commits into
Open
fix: abort waitFor server_interactive tests promptly on file worker fatal error#13710nomeata wants to merge 3 commits into
waitFor server_interactive tests promptly on file worker fatal error#13710nomeata wants to merge 3 commits into
Conversation
This PR makes `Lean.Lsp.Ipc.readMessage` throw when the watchdog reports its file worker has crashed (`$/lean/fileProgress` with `fatalError` kind), so that test-runner read loops with no pending request -- in particular `waitForMessage` -- don't sit forever waiting for a message a dead worker will never produce. `collectDiagnostics` and `waitForILeans` already exit on crashes because the watchdog also sends a `responseError` to their pending `waitForDiagnostics` / `waitForILeans` requests. `waitForMessage`, which doesn't issue a request, was silently discarding the `fatalError` notification and blocking until the test framework's outer timeout killed the whole process. With this change, a forced-crash test (`IO.Process.forceExit 9` from inside an `elab` command, then a `--^ waitFor: "..."` directive) returns in ~1 second with `uncaught exception: Lean file worker reported a fatal error (likely crashed)` instead of timing out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR follows up on the previous IPC fatalError-abort fix to address a false-positive trigger and adds a regression test. The previous version checked `$/lean/fileProgress` with `fatalError` kind inside `Ipc.readMessage`, which fires for *every* IPC consumer. That breaks tests like `importCompletion.lean`, which intentionally elaborates a file with an incomplete `import` line; the worker stays alive but reports `fileProgress fatalError` because header processing failed, and the test then uses the still-alive worker to obtain completion items. The blanket abort prevented the completions from ever being read. Move the check into `waitForMessage` itself, where it actually matters: that loop has no pending request, so the watchdog's `responseError` (which `collectDiagnostics` / `waitForILeans` already handle) doesn't help, and without this check it would block until the outer test timeout. All other IPC paths are unaffected. Also wire up `TEST_EXIT` in `tests/server_interactive/run_test.sh` so individual tests can declare expected exit codes via `<file>.init.sh`, matching the convention in `tests/compile/`. The new `worker_crash` test exercises the abort path: `IO.Process.forceExit` from inside an `elab` command, a `waitFor` directive on the line after, `TEST_EXIT=1`, and an `.out.expected` matching the `uncaught exception: waitForMessage: ...` output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Mathlib CI status (docs):
|
Collaborator
|
Reference manual CI status:
|
stepbrobd
pushed a commit
to stepbrobd/lean4
that referenced
this pull request
May 12, 2026
This PR deletes two tests that sometimes timeout (or crash, unclear without leanprover#13710) and I was not able to fix it by EOD.
This PR moves `waitForMessage` out of `Lean.Lsp.Ipc` and into `Lean.Server.Test.Runner`, its only caller. The function exists to implement the `waitFor` test directive and is not a general-purpose IPC primitive (it discards all received messages other than the one it's waiting for, which is an actively unsafe pattern outside of a test driver). Per Sebastian's review of the prior commits, keeping it inside `Ipc` was the wrong layer. No behavior change: the fatalError-abort check added in the previous commit moves along with the function. The new home is right next to `processWaitFor`, which is the sole caller. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nomeata
added a commit
that referenced
this pull request
May 12, 2026
This PR restores the `cancellation_empty_by.lean` and `cancellation_par.lean` server-interactive tests that were deleted in #13711 ("delete flaky tests for now"). With #13710 in place, a hung worker now surfaces as a prompt `waitForMessage` abort rather than a 1500s CTest timeout, so the tests are tractable to keep enabled. Trim diagnostic tracing in `cancellation_empty_by.lean` to the minimum that's actually causally ordered against the test's synchronisation points: * `test: imports done` -- synchronous `#eval` at the top of the file, fires once before any async task is spawned. * `tracerSuggestion ready` -- gated to the first `tracerSuggestion` invocation via `mkTestTask`, fires exactly once. * `cancelTokenSet` -- inside the `cancelTk.onSet` callback, fires exactly once when `cancelRec` reaches the snapshot. * `sync received` (x2) -- in `t1`'s body after `wait_for_sync` returns, once per elaboration. Removed traces that were either every-invocation (and therefore raced non-deterministically with the snapshot task's other output) or that fired at command-elab boundaries where their position relative to async stderr buffers was unstable: `tracerSuggestion: entered`, `tracerSuggestion: returning candidate`, `t1: body entered`, `test: before/after empty-by example`, `test: file end`. Also drop the `LEAN_DEBUG_MAX_WORKERS` runtime hack added during investigation: it served its purpose (confirming the wait-pressure-driven worker pool ratchet in `task_manager::wait_for`) but is not appropriate as a long-lived debug knob in the runtime. Stress: 24/24 parallel runs pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
waitFor server_interactive tests promptly on file worker fatal error
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR makes the test-only
waitForMessagehelper abort promptlywhen the Lean language server reports a fatalError, instead of
blocking until the outer test framework's timeout kills the process.
waitForMessageis used to implement thewaitFordirective intests/server_interactivetests: it reads server messages in a loopand returns when it sees a
textDocument/publishDiagnosticsnotification containing a given message. It has no pending request,
so the watchdog's
responseError(which already terminatescollectDiagnosticsandwaitForILeanson a fatalError) doesn'thelp here, and it would silently discard
$/lean/fileProgressnotifications. When the file worker crashes or its header processing
fails fatally (e.g. an unresolved import), the awaited diagnostic
will never be emitted -- so
waitForMessagewould block indefinitelyand report nothing useful, which made several CI failures of
server_interactive tests look like generic 1500s timeouts. With this
change,
waitForMessagealso reacts to$/lean/fileProgresswithfatalErrorkind and throws a descriptive error.The detection is scoped to
waitForMessageonly, not the underlyingIpc.readMessage, so that other IPC paths (notablyimportCompletion.lean, which intentionally elaborates a file withan incomplete
importline and observes the resulting fatalErrorfileProgress while still using the alive worker for completion
requests) keep working unchanged. As part of this PR,
waitForMessagehas been moved out of
Lean.Lsp.Ipcand intoLean.Server.Test.Runnernext to its sole callerprocessWaitFor,since it is a test-driver helper rather than a general-purpose IPC
primitive (it discards all unrelated messages, which would be unsafe
outside of a test driver).
To exercise the new behavior,
tests/server_interactive/run_test.shnow sources
<file>.init.shand usescheck_exit_is "${TEST_EXIT:-0}",matching the convention in
tests/compile/. The newworker_crash.leantest forces the file worker to die viaIO.Process.forceExit 9from inside anelabcommand and thenissues a
waitFordirective that can never be satisfied;TEST_EXIT=1and.out.expectedcapture the abort path. Withoutthis PR the test would sit at the framework timeout; with this PR it
returns in well under a second.