fix: stop forkserver workers from orphaning when serve+index race (closes #66)#69
Conversation
Five fixes for the WSL OOM AZagatti reported, all in one PR because they form a connected failure mode: a sibling `cce index` triggers read-only events that `cce serve` mis-routes into its reindex worker, which spawns a fastembed forkserver pool that orphans on shutdown. 1. watcher: switch _DebouncedHandler from on_any_event to typed on_modified/on_created/on_deleted/on_moved handlers. Read-only `opened`/`closed_no_write` events from the sibling indexer no longer queue reindex work. This is the actual trigger. 2. embedder._resolve_parallel: CCE_EMBED_PARALLEL=0 (and the strings "none"/"off"/"false"/"no", case-insensitive) now map to None (single-process), matching the darwin/win32 default. Previously max(1, int(v)) floored to 1, but parallel=1 still takes the multiprocessing path. Also caps positive overrides at cpu_count so `CCE_EMBED_PARALLEL=64` on a 12-CPU host can't over-spawn. 3. embedder: _resolve_parallel is now evaluated per embed call instead of at import. Required for `cce serve` to set the env var inside the function and have it take effect. 4. cli._run_serve: defaults CCE_EMBED_PARALLEL=0 via os.environ.setdefault so any reindex worker triggered inside `cce serve` runs single-process. User override (CCE_EMBED_PARALLEL=4 in .mcp.json) is respected. Bulk `cce index` from a separate shell still gets the multiprocess path. 5. cli._run_serve: SIGINT, SIGTERM, and SIGHUP all route through one asyncio signal handler that cancels the MCP task, so Ctrl-C now triggers the existing finally cleanup instead of being swallowed by stdio reads. Previously only SIGTERM worked, leaving SIGKILL as the common path — which is exactly what produced the orphan workers. 6. hook_server: register an on_cleanup callback that unlinks both serve.port files (storage_base + ~/.cce/projects/<name>/serve.port). Stale port files used to survive across exits and silently break memory capture in the next session. Tests added: - Six new test_embedder.py cases: CCE_EMBED_PARALLEL=0, the string tokens, cpu_count cap, negative values, lazy re-evaluation - Three new test_watcher.py cases: read-only event suppression, directory event skip, idempotent moves with src==dest - New tests/memory/test_hook_server_cleanup.py: port files created and removed, tolerates missing files, default-layout edge case Suite: 891 passed, 1 skipped, 0 failed. Ruff clean.
There was a problem hiding this comment.
Pull request overview
This PR addresses issue #66 by breaking the serve+index failure chain that could cause cce serve to spawn fastembed forkserver workers and then orphan/leak them on abnormal shutdown. It does so by tightening watcher event filtering, making embedding parallelism configurable (including a true single-process mode), ensuring cce serve defaults to single-process embedding, improving signal-driven shutdown, and cleaning up stale serve.port files.
Changes:
- Restrict watcher-triggered reindexing to content-changing filesystem events only.
- Make
CCE_EMBED_PARALLELsupport true disable (0and string tokens), cap values tocpu_count, and resolve lazily per embed call; defaultcce serveto single-process embedding. - Add orderly shutdown handling for
cce serveand ensure hook server port files are removed on cleanup; expand regression test coverage.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/context_engine/cli.py | Defaults serve to single-process embedding and adds signal handlers to drive orderly shutdown. |
| src/context_engine/indexer/embedder.py | Updates parallel resolution semantics (disable tokens, cpu cap) and evaluates per embed call. |
| src/context_engine/indexer/watcher.py | Switches from on_any_event to typed handlers to ignore read-only events. |
| src/context_engine/memory/hook_server.py | Tracks/wipes both authoritative and rendezvous serve.port files during cleanup. |
| tests/indexer/test_embedder.py | Adds regression tests for new _resolve_parallel behavior. |
| tests/indexer/test_watcher.py | Adds regression tests for event-type filtering behavior. |
| tests/memory/test_hook_server_cleanup.py | New tests validating hook server port-file creation and cleanup behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Five review findings, all resolved: 1. Windows crash: `signal.SIGHUP` is not defined on Windows, so the literal tuple `(signal.SIGINT, signal.SIGTERM, signal.SIGHUP)` raised AttributeError *before* the try/except could swallow it, making `cce serve` unrunnable on Windows. Build the candidate list with `getattr(signal, 'SIGHUP', None)` and skip None entries. 2. Test event loop leak: `_make_handler` created an `asyncio.new_event_loop()` that was never closed, emitting ResourceWarnings on GC and risking flakes under -W error. Tracked the loops in a per-test `synthetic_loops` fixture that closes them on teardown. 3. hook_server cleanup comment used to claim it covered SIGKILL paths, but `app.on_cleanup` doesn't fire on SIGKILL. Reworded to describe what it actually covers (graceful exits) and pointed at the socket-liveness probe for the ungraceful case. 4. watcher.on_moved comment said the handler queued "whichever side is inside the watch dir" but the implementation didn't actually filter — _enqueue/_should_ignore does. Reworded the comment to match the real flow. 5. embedder._resolve_parallel docstring used a misleading `"<int>=N"` placeholder. Replaced with `"N"` to match the actual parser. 891 passed, 1 skipped, 0 failed. Ruff clean.
rajkumarsakthivel
left a comment
There was a problem hiding this comment.
LGTM. All 5 root causes from #66 addressed cleanly.
Watcher event filtering breaks the leak chain. Lazy _resolve_parallel, SIGINT routing, and serve.port cleanup are solid. Windows-safe signal handling with getattr for SIGHUP. 12 new test cases with good coverage including the synthetic_loops fixture for cleanup.
Closes #66. Lands all five fixes from the issue thread in a single PR
because they form one connected failure mode rather than independent
bugs:
Cutting any one link breaks the leak. Fixing all five removes the
failure mode and tightens adjacent rough edges (env-var foot-guns,
unkillable serve, stale port files).
Fixes
Watcher event filtering (
indexer/watcher.py) —_DebouncedHandlerswitches from
on_any_eventto typedon_modified/on_created/on_deleted/on_moved. Read-onlyopened/closed_no_writeeventsfrom the sibling indexer no longer queue reindex work. This is the
actual trigger.
CCE_EMBED_PARALLEL=0disables, plus string tokens(
none/off/false/no, case-insensitive) (indexer/embedder.py).Old behaviour
max(1, int(v))floored to 1 butparallel=1stilltook the multiprocessing path. Positive values are now capped at
cpu_countsoCCE_EMBED_PARALLEL=64on a 12-CPU host can'tover-spawn.
_resolve_parallelis lazy — evaluated per embed call instead ofat import. Required so
cce servecan set the env var inside thefunction body and have it take effect.
cce servedefaults to single-process embed (cli._run_serve)via
os.environ.setdefault(\"CCE_EMBED_PARALLEL\", \"0\"). Useroverrides (e.g.
\"env\": {\"CCE_EMBED_PARALLEL\": \"4\"}in.mcp.json) are respected. Bulkcce indexfrom a separate shellstill gets the multiprocess path.
SIGINT routes through orderly shutdown (
cli._run_serve) — SIGINT,SIGTERM, and SIGHUP share one asyncio signal handler that cancels
the MCP task so the existing
finallycleanup runs. Previously onlySIGTERM worked, leaving SIGKILL as the common path — exactly what
produced the orphan workers in the first place.
serve.portcleanup (memory/hook_server.py) —on_cleanupcallback unlinks both port files (storage_base + rendezvous). Stale
files no longer survive across exits and silently break memory
capture in the next session.
Tests
test_embedder.py+6 cases — zero, string tokens, cpu_count cap,negative values, lazy re-evaluation
test_watcher.py+3 cases — read-only events ignored, directoryevents skipped, src==dest move is idempotent
tests/memory/test_hook_server_cleanup.py(new, 3 cases) — portfiles created and removed, tolerates missing files, default-layout
edge case
Verification
pytest -n 4→ 891 passed, 1 skipped, 0 failedruff checkon every changed file → cleanTest plan
cce serverunning, then siblingcce index --path README.md— no forkserver children shouldappear under serve's PID
kill -2(SIGINT) thecce serveprocess — clean exit, nostale
serve.portleft behindCCE_EMBED_PARALLEL=0 cce indexon a real project — finishes,single-process, no orphans
CCE_EMBED_PARALLEL=4in.mcp.json— stillgets 4 workers when running
cce indexoutside serve