Skip to content

test: avoid deadlock in cancellation_empty_by#13707

Draft
nomeata wants to merge 15 commits into
masterfrom
joachim/empty-by-test2
Draft

test: avoid deadlock in cancellation_empty_by#13707
nomeata wants to merge 15 commits into
masterfrom
joachim/empty-by-test2

Conversation

@nomeata
Copy link
Copy Markdown
Collaborator

@nomeata nomeata commented May 11, 2026

This PR further attempts to avoid deadlocks in the cancellation_empty_by test.

nomeata and others added 3 commits May 11, 2026 13:34
Drop the `dbg_trace "sync received"` marker in `t1`'s body (and the
two corresponding lines from `.out.expected`).

In the first elaboration, `tracerSuggestion ready` and `sync received`
are causally ordered: `tracerSuggestion` must resolve the sync promise
before t1's `wait_for_sync` can return. But in the second elaboration
(after the runner's insert), the sync promise is already resolved, so
t1's `wait_for_sync` returns immediately -- and t1's body (running on
its async theorem-body task) races against the empty-by snapshot
task's tracerSuggestion (running on its own snapshot task). Both
write to stderr with no causal dependency, so the order of
"tracerSuggestion ready" vs. "sync received" from the second pass is
not deterministic under CPU load.

Stress test (6 parallel instances on a 96-core box, sustained CPU
saturation) reproduced this ordering swap at ~3% rate on the prior
version; passes 300/300 with the dropped marker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move `dbg_trace "tracerSuggestion ready"` inside the `if let some prom`
block so it only fires on the first invocation. On re-elaboration
`mkTestTask` returns `none` (the label is already registered) and
there is no point printing the "ready" marker again -- only the
synchronization side effect (`resolveSyncPromise`, idempotent) is
relevant.

This eliminates the racing pair from the second elaboration that
the previous "harden against scheduling races" commit worked
around by deleting `dbg_trace "sync received"`. With the gate in
place, `sync received` is now stable to restore as a marker that
`t1`'s body actually ran.

Also add an `else` branch on the cancel-token check that prints a
diagnostic message: if the test is ever wired up so that
`tracerSuggestion` sees no cancel token, the promise would be
silently dropped and `wait_for_test_task` would report "task
dropped without resolution" with no breadcrumb to the actual cause.

Note: the second elaboration's `sync received` now competes with the
`cancelTokenSet` print from the first elaboration's `onSet` callback
(fired during `cancelRec`). Their order in `.out.expected` reflects
the empirical ordering caused by snapshot-task stderr capture: the
snapshot task buffers `tracerSuggestion`'s `dbg_trace` via
`withIsolatedStreams` and only flushes after the snapshot completes,
while the `onSet` callback runs outside that capture and reaches
stderr immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI has reported `Timeout 1500s` failures for this test with no captured
stderr output at all (not even the `dbg_trace` markers). The failure
has not been reproduced locally under heavy contention (LEAN_NUM_THREADS=1,
CPU-pinned, 8-way parallel, hundreds of runs), so the root cause is
still unknown.

Two changes to help triage future failures:

1. Add `source_init "$1"` to `tests/server_interactive/run_test.sh` so
   individual tests can carry an `.init.sh` (matches the convention in
   `tests/elab/run_test.sh` and friends). This is infrastructure used by
   the second change.

2. Add `cancellation_empty_by.lean.init.sh` setting `LEAN_NUM_THREADS=2`.
   Per the runtime in `src/runtime/object.cpp:wait_for`, `IO.wait` will
   spawn additional workers when called from a pool thread, so thread
   starvation isn't strictly possible -- but lower thread count shrinks
   the surface for unobserved scheduling weirdness. If CI starts passing,
   the cause is in the parallel scheduling. If CI keeps timing out, the
   cause is elsewhere.

3. Add two `#eval (IO.eprintln ...)` checkpoint markers in the test
   file: `test: imports done` after the imports, and `test: empty-by
   command returned` after the empty-by example. These are synchronous
   command-level prints, independent of the snapshot task and t1's async
   body, so they tell us "did the file worker process this command at
   all?" even if the dynamic parts hang.

Document the observed-but-non-causal output ordering in the test
docstring (`cancelTokenSet` appears before `tracerSuggestion ready`
in the captured log; likely an interaction between buffered stderr
writes from the snapshot worker thread and direct writes from the
thread calling `cancelTk.set`). The order is stable across 300+ local
runs and we encode it as-is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nomeata nomeata added the merge-ci Enable merge queue CI checks for PR. In particular, produce artifacts for all major platforms. label May 11, 2026
@github-actions github-actions Bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label May 11, 2026
@mathlib-lean-pr-testing
Copy link
Copy Markdown

mathlib-lean-pr-testing Bot commented May 11, 2026

Mathlib CI status (docs):

  • ❗ Batteries/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase 48ad8401cd5624c944890893a38057ae4920e8ef --onto a71f158f7bd96ff9ea846f7ff4cc658de3c8b0f9. You can force Mathlib CI using the force-mathlib-ci label. (2026-05-11 15:54:02)
  • ❗ Batteries/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase f52a18a9472f40a864a78d3c8f8cf202751f576b --onto 2229b077d6ebd2ff42d665fe1c32501df8915dff. You can force Mathlib CI using the force-mathlib-ci label. (2026-05-11 20:34:12)

@leanprover-bot
Copy link
Copy Markdown
Collaborator

leanprover-bot commented May 11, 2026

Reference manual CI status:

  • ❗ Reference manual CI will not be attempted unless your PR branches off the nightly-with-manual branch. Try git rebase 48ad8401cd5624c944890893a38057ae4920e8ef --onto a71f158f7bd96ff9ea846f7ff4cc658de3c8b0f9. You can force reference manual CI using the force-manual-ci label. (2026-05-11 15:54:04)
  • ❗ Reference manual CI will not be attempted unless your PR branches off the nightly-with-manual branch. Try git rebase f52a18a9472f40a864a78d3c8f8cf202751f576b --onto a71f158f7bd96ff9ea846f7ff4cc658de3c8b0f9. You can force reference manual CI using the force-manual-ci label. (2026-05-11 20:34:14)

nomeata and others added 11 commits May 11, 2026 19:13
This PR adds an option `tactic.try.onlyUserSuggestions` (default `false`).
When set, `try?` skips all its built-in suggestion branches (`simple`,
`simp`, `grind`, `simp_all`, `fun_induction`/`induction`, and `exact?`)
and only runs the tactics produced by user-registered
`@[try_suggestion]` generators.

The motivation is testing: the cancellation/empty-by snapshot tests
only want to exercise the `try?` machinery on a specific tracer
suggestion, and don't want to pay the library-search cost. With the
option set, the test path is dramatically faster (`3.83s → 1.50s`
locally) and uses much less memory, making it practical to stress.

The user-tactic dispatch in the new path uses `first | ...`, matching
the convention for trying tactics in order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR makes `Lean.Lsp.Ipc.readMessage` throw when the watchdog
reports its file worker has crashed (`$/lean/fileProgress` with
`fatalError` kind), so that test-runner read loops with no pending
request -- in particular `waitForMessage` -- don't sit forever waiting
for a message a dead worker will never produce.

`collectDiagnostics` and `waitForILeans` already exit on crashes
because the watchdog also sends a `responseError` to their pending
`waitForDiagnostics` / `waitForILeans` requests. `waitForMessage`,
which doesn't issue a request, was silently discarding the `fatalError`
notification and blocking until the test framework's outer timeout
killed the whole process.

With this change, a forced-crash test (`IO.Process.forceExit 9` from
inside an `elab` command, then a `--^ waitFor: "..."` directive)
returns in ~1 second with `uncaught exception: Lean file worker
reported a fatal error (likely crashed)` instead of timing out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nostic)

Adds a debug-only env-driven guard in `task_manager::wait_for`: if
`LEAN_DEBUG_MAX_WORKERS=N` is set and `m_std_workers.size()` exceeds
`N`, the process writes a one-line diagnostic to fd 2 (pid, pool
counts) and exits via `_exit(123)`.

This was useful for tracking down stress-test memory blowups in the
cancellation_empty_by test: the lean task pool ratchets under wait
pressure (`m_max_std_workers++` on each in-pool `wait_for`, with
`spawn_worker()` if no idle worker is available), each new worker
holds its own mimalloc arena, and process RSS climbs accordingly --
which `-M` then trips. The guard makes the symptom reproducible at a
specific threshold.

Uses raw `write(2, ...)` and `_exit` rather than `std::cerr` /
`std::abort` / `print_backtrace`, because the check is inside
`m_mutex`; the usual print/abort paths can deadlock or hang flushing
buffered IO while other task workers are blocked on the same mutex.

Not wired up by default and a no-op without the env var; intentionally
not a long-term feature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds diagnostic anchors to make it possible to tell, from a failed
`.out.produced`, which stage of the test the file worker reached:

- `#eval` of `IO.eprintln "test: imports done"` after imports (sync,
  fires once, deterministically first).
- `#eval` markers before and after the empty-`by` example.
- `dbg_trace` at the entry of `tracerSuggestion` and right before its
  return.
- `dbg_trace` at the entry of `t1`'s tactic body.

Also flips `set_option tactic.try.onlyUserSuggestions true` so the
snapshot task's `try?` skips its library-search branches (paired
with the corresponding feature commit). Without it the test path
spends most of its time in `exact?`/`grind`/`simp_all`, none of
which the test exercises.

The expected output is updated to match the new traces. The order
between async-task and command-level prints is empirical; this is a
test of the snapshot/cancellation plumbing, not of trace ordering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR makes `Lean.Lsp.Ipc.readMessage` throw when the watchdog
reports its file worker has crashed (`$/lean/fileProgress` with
`fatalError` kind), so that test-runner read loops with no pending
request -- in particular `waitForMessage` -- don't sit forever waiting
for a message a dead worker will never produce.

`collectDiagnostics` and `waitForILeans` already exit on crashes
because the watchdog also sends a `responseError` to their pending
`waitForDiagnostics` / `waitForILeans` requests. `waitForMessage`,
which doesn't issue a request, was silently discarding the `fatalError`
notification and blocking until the test framework's outer timeout
killed the whole process.

With this change, a forced-crash test (`IO.Process.forceExit 9` from
inside an `elab` command, then a `--^ waitFor: "..."` directive)
returns in ~1 second with `uncaught exception: Lean file worker
reported a fatal error (likely crashed)` instead of timing out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR follows up on the previous IPC fatalError-abort fix to address
a false-positive trigger and adds a regression test.

The previous version checked `$/lean/fileProgress` with `fatalError`
kind inside `Ipc.readMessage`, which fires for *every* IPC consumer.
That breaks tests like `importCompletion.lean`, which intentionally
elaborates a file with an incomplete `import` line; the worker stays
alive but reports `fileProgress fatalError` because header processing
failed, and the test then uses the still-alive worker to obtain
completion items. The blanket abort prevented the completions from
ever being read.

Move the check into `waitForMessage` itself, where it actually
matters: that loop has no pending request, so the watchdog's
`responseError` (which `collectDiagnostics` / `waitForILeans` already
handle) doesn't help, and without this check it would block until the
outer test timeout. All other IPC paths are unaffected.

Also wire up `TEST_EXIT` in `tests/server_interactive/run_test.sh` so
individual tests can declare expected exit codes via
`<file>.init.sh`, matching the convention in `tests/compile/`. The
new `worker_crash` test exercises the abort path:
`IO.Process.forceExit` from inside an `elab` command, a `waitFor`
directive on the line after, `TEST_EXIT=1`, and an `.out.expected`
matching the `uncaught exception: waitForMessage: ...` output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR moves `waitForMessage` out of `Lean.Lsp.Ipc` and into
`Lean.Server.Test.Runner`, its only caller. The function exists to
implement the `waitFor` test directive and is not a general-purpose
IPC primitive (it discards all received messages other than the one
it's waiting for, which is an actively unsafe pattern outside of a
test driver). Per Sebastian's review of the prior commits, keeping it
inside `Ipc` was the wrong layer.

No behavior change: the fatalError-abort check added in the previous
commit moves along with the function. The new home is right next to
`processWaitFor`, which is the sole caller.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR restores the `cancellation_empty_by.lean` and
`cancellation_par.lean` server-interactive tests that were deleted in
#13711 ("delete flaky tests for now"). With #13710 in place, a hung
worker now surfaces as a prompt `waitForMessage` abort rather than a
1500s CTest timeout, so the tests are tractable to keep enabled.

Trim diagnostic tracing in `cancellation_empty_by.lean` to the
minimum that's actually causally ordered against the test's
synchronisation points:

* `test: imports done` -- synchronous `#eval` at the top of the file,
  fires once before any async task is spawned.
* `tracerSuggestion ready` -- gated to the first `tracerSuggestion`
  invocation via `mkTestTask`, fires exactly once.
* `cancelTokenSet` -- inside the `cancelTk.onSet` callback, fires
  exactly once when `cancelRec` reaches the snapshot.
* `sync received` (x2) -- in `t1`'s body after `wait_for_sync`
  returns, once per elaboration.

Removed traces that were either every-invocation (and therefore raced
non-deterministically with the snapshot task's other output) or that
fired at command-elab boundaries where their position relative to
async stderr buffers was unstable: `tracerSuggestion: entered`,
`tracerSuggestion: returning candidate`, `t1: body entered`,
`test: before/after empty-by example`, `test: file end`.

Also drop the `LEAN_DEBUG_MAX_WORKERS` runtime hack added during
investigation: it served its purpose (confirming the
wait-pressure-driven worker pool ratchet in `task_manager::wait_for`)
but is not appropriate as a long-lived debug knob in the runtime.

Stress: 24/24 parallel runs pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nomeata nomeata added release-ci Enable all CI checks for a PR, like is done for releases and removed merge-ci Enable merge queue CI checks for PR. In particular, produce artifacts for all major platforms. labels May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-ci Enable all CI checks for a PR, like is done for releases toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants