Skip to content

fix: abort waitFor server_interactive tests promptly on file worker fatal error#13710

Open
nomeata wants to merge 3 commits into
masterfrom
joachim/ipc-fatal-error-abort
Open

fix: abort waitFor server_interactive tests promptly on file worker fatal error#13710
nomeata wants to merge 3 commits into
masterfrom
joachim/ipc-fatal-error-abort

Conversation

@nomeata
Copy link
Copy Markdown
Collaborator

@nomeata nomeata commented May 11, 2026

This PR makes the test-only waitForMessage helper abort promptly
when the Lean language server reports a fatalError, instead of
blocking until the outer test framework's timeout kills the process.

waitForMessage is used to implement the waitFor directive in
tests/server_interactive tests: it reads server messages in a loop
and returns when it sees a textDocument/publishDiagnostics
notification containing a given message. It has no pending request,
so the watchdog's responseError (which already terminates
collectDiagnostics and waitForILeans on a fatalError) doesn't
help here, and it would silently discard $/lean/fileProgress
notifications. When the file worker crashes or its header processing
fails fatally (e.g. an unresolved import), the awaited diagnostic
will never be emitted -- so waitForMessage would block indefinitely
and report nothing useful, which made several CI failures of
server_interactive tests look like generic 1500s timeouts. With this
change, waitForMessage also reacts to $/lean/fileProgress with
fatalError kind and throws a descriptive error.

The detection is scoped to waitForMessage only, not the underlying
Ipc.readMessage, so that other IPC paths (notably
importCompletion.lean, which intentionally elaborates a file with
an incomplete import line and observes the resulting fatalError
fileProgress while still using the alive worker for completion
requests) keep working unchanged. As part of this PR, waitForMessage
has been moved out of Lean.Lsp.Ipc and into
Lean.Server.Test.Runner next to its sole caller processWaitFor,
since it is a test-driver helper rather than a general-purpose IPC
primitive (it discards all unrelated messages, which would be unsafe
outside of a test driver).

To exercise the new behavior, tests/server_interactive/run_test.sh
now sources <file>.init.sh and uses check_exit_is "${TEST_EXIT:-0}",
matching the convention in tests/compile/. The new
worker_crash.lean test forces the file worker to die via
IO.Process.forceExit 9 from inside an elab command and then
issues a waitFor directive that can never be satisfied;
TEST_EXIT=1 and .out.expected capture the abort path. Without
this PR the test would sit at the framework timeout; with this PR it
returns in well under a second.

This PR makes `Lean.Lsp.Ipc.readMessage` throw when the watchdog
reports its file worker has crashed (`$/lean/fileProgress` with
`fatalError` kind), so that test-runner read loops with no pending
request -- in particular `waitForMessage` -- don't sit forever waiting
for a message a dead worker will never produce.

`collectDiagnostics` and `waitForILeans` already exit on crashes
because the watchdog also sends a `responseError` to their pending
`waitForDiagnostics` / `waitForILeans` requests. `waitForMessage`,
which doesn't issue a request, was silently discarding the `fatalError`
notification and blocking until the test framework's outer timeout
killed the whole process.

With this change, a forced-crash test (`IO.Process.forceExit 9` from
inside an `elab` command, then a `--^ waitFor: "..."` directive)
returns in ~1 second with `uncaught exception: Lean file worker
reported a fatal error (likely crashed)` instead of timing out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nomeata nomeata requested a review from mhuisi as a code owner May 11, 2026 19:14
@nomeata nomeata added the changelog-language Language features and metaprograms label May 11, 2026
@nomeata nomeata marked this pull request as draft May 11, 2026 19:44
This PR follows up on the previous IPC fatalError-abort fix to address
a false-positive trigger and adds a regression test.

The previous version checked `$/lean/fileProgress` with `fatalError`
kind inside `Ipc.readMessage`, which fires for *every* IPC consumer.
That breaks tests like `importCompletion.lean`, which intentionally
elaborates a file with an incomplete `import` line; the worker stays
alive but reports `fileProgress fatalError` because header processing
failed, and the test then uses the still-alive worker to obtain
completion items. The blanket abort prevented the completions from
ever being read.

Move the check into `waitForMessage` itself, where it actually
matters: that loop has no pending request, so the watchdog's
`responseError` (which `collectDiagnostics` / `waitForILeans` already
handle) doesn't help, and without this check it would block until the
outer test timeout. All other IPC paths are unaffected.

Also wire up `TEST_EXIT` in `tests/server_interactive/run_test.sh` so
individual tests can declare expected exit codes via
`<file>.init.sh`, matching the convention in `tests/compile/`. The
new `worker_crash` test exercises the abort path:
`IO.Process.forceExit` from inside an `elab` command, a `waitFor`
directive on the line after, `TEST_EXIT=1`, and an `.out.expected`
matching the `uncaught exception: waitForMessage: ...` output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nomeata nomeata marked this pull request as ready for review May 11, 2026 19:52
@github-actions github-actions Bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label May 11, 2026
@mathlib-lean-pr-testing
Copy link
Copy Markdown

mathlib-lean-pr-testing Bot commented May 11, 2026

Mathlib CI status (docs):

  • ❗ Batteries/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase f52a18a9472f40a864a78d3c8f8cf202751f576b --onto 2229b077d6ebd2ff42d665fe1c32501df8915dff. You can force Mathlib CI using the force-mathlib-ci label. (2026-05-11 20:12:10)
  • ❗ Batteries/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase f52a18a9472f40a864a78d3c8f8cf202751f576b --onto c04a83a2e5d5718032819f2d124f2c949fab292f. You can force Mathlib CI using the force-mathlib-ci label. (2026-05-12 14:00:58)

@leanprover-bot
Copy link
Copy Markdown
Collaborator

leanprover-bot commented May 11, 2026

Reference manual CI status:

  • ❗ Reference manual CI will not be attempted unless your PR branches off the nightly-with-manual branch. Try git rebase f52a18a9472f40a864a78d3c8f8cf202751f576b --onto a71f158f7bd96ff9ea846f7ff4cc658de3c8b0f9. You can force reference manual CI using the force-manual-ci label. (2026-05-11 20:12:12)
  • ❗ Reference manual CI will not be attempted unless your PR branches off the nightly-with-manual branch. Try git rebase f52a18a9472f40a864a78d3c8f8cf202751f576b --onto c04a83a2e5d5718032819f2d124f2c949fab292f. You can force reference manual CI using the force-manual-ci label. (2026-05-12 14:01:00)

stepbrobd pushed a commit to stepbrobd/lean4 that referenced this pull request May 12, 2026
This PR deletes two tests that sometimes timeout (or crash, unclear
without leanprover#13710) and I was not able to fix it by EOD.
@nomeata nomeata closed this May 12, 2026
@nomeata nomeata reopened this May 12, 2026
@nomeata nomeata marked this pull request as draft May 12, 2026 13:33
This PR moves `waitForMessage` out of `Lean.Lsp.Ipc` and into
`Lean.Server.Test.Runner`, its only caller. The function exists to
implement the `waitFor` test directive and is not a general-purpose
IPC primitive (it discards all received messages other than the one
it's waiting for, which is an actively unsafe pattern outside of a
test driver). Per Sebastian's review of the prior commits, keeping it
inside `Ipc` was the wrong layer.

No behavior change: the fatalError-abort check added in the previous
commit moves along with the function. The new home is right next to
`processWaitFor`, which is the sole caller.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nomeata nomeata marked this pull request as ready for review May 12, 2026 14:52
nomeata added a commit that referenced this pull request May 12, 2026
This PR restores the `cancellation_empty_by.lean` and
`cancellation_par.lean` server-interactive tests that were deleted in
#13711 ("delete flaky tests for now"). With #13710 in place, a hung
worker now surfaces as a prompt `waitForMessage` abort rather than a
1500s CTest timeout, so the tests are tractable to keep enabled.

Trim diagnostic tracing in `cancellation_empty_by.lean` to the
minimum that's actually causally ordered against the test's
synchronisation points:

* `test: imports done` -- synchronous `#eval` at the top of the file,
  fires once before any async task is spawned.
* `tracerSuggestion ready` -- gated to the first `tracerSuggestion`
  invocation via `mkTestTask`, fires exactly once.
* `cancelTokenSet` -- inside the `cancelTk.onSet` callback, fires
  exactly once when `cancelRec` reaches the snapshot.
* `sync received` (x2) -- in `t1`'s body after `wait_for_sync`
  returns, once per elaboration.

Removed traces that were either every-invocation (and therefore raced
non-deterministically with the snapshot task's other output) or that
fired at command-elab boundaries where their position relative to
async stderr buffers was unstable: `tracerSuggestion: entered`,
`tracerSuggestion: returning candidate`, `t1: body entered`,
`test: before/after empty-by example`, `test: file end`.

Also drop the `LEAN_DEBUG_MAX_WORKERS` runtime hack added during
investigation: it served its purpose (confirming the
wait-pressure-driven worker pool ratchet in `task_manager::wait_for`)
but is not appropriate as a long-lived debug knob in the runtime.

Stress: 24/24 parallel runs pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nomeata nomeata changed the title fix: abort IPC test runner promptly on file worker fatal error fix: abort waitFor server_interactive tests promptly on file worker fatal error May 12, 2026
@nomeata nomeata requested a review from Kha May 12, 2026 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog-language Language features and metaprograms toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants