Skip to content

fix: stop forkserver workers from orphaning when serve+index race (closes #66)#69

Merged
rajkumarsakthivel merged 3 commits into
mainfrom
fix/66-worker-leak-and-watcher-events
May 12, 2026
Merged

fix: stop forkserver workers from orphaning when serve+index race (closes #66)#69
rajkumarsakthivel merged 3 commits into
mainfrom
fix/66-worker-leak-and-watcher-events

Conversation

@fazleelahhee
Copy link
Copy Markdown
Contributor

Closes #66. Lands all five fixes from the issue thread in a single PR
because they form one connected failure mode rather than independent
bugs:

A sibling cce index emits read-only inotify events → cce serve
mis-routes them into its reindex worker via on_any_event → the
reindex spawns a fastembed forkserver pool → the pool can't be
cleanly torn down because SIGINT is swallowed → workers orphan to
init and leak ~5 GB per cce index, compounding across a session.

Cutting any one link breaks the leak. Fixing all five removes the
failure mode and tightens adjacent rough edges (env-var foot-guns,
unkillable serve, stale port files).

Fixes

  1. Watcher event filtering (indexer/watcher.py) — _DebouncedHandler
    switches from on_any_event to typed on_modified/on_created/
    on_deleted/on_moved. Read-only opened/closed_no_write events
    from the sibling indexer no longer queue reindex work. This is the
    actual trigger.

  2. CCE_EMBED_PARALLEL=0 disables, plus string tokens
    (none/off/false/no, case-insensitive) (indexer/embedder.py).
    Old behaviour max(1, int(v)) floored to 1 but parallel=1 still
    took the multiprocessing path. Positive values are now capped at
    cpu_count so CCE_EMBED_PARALLEL=64 on a 12-CPU host can't
    over-spawn.

  3. _resolve_parallel is lazy — evaluated per embed call instead of
    at import. Required so cce serve can set the env var inside the
    function body and have it take effect.

  4. cce serve defaults to single-process embed (cli._run_serve)
    via os.environ.setdefault(\"CCE_EMBED_PARALLEL\", \"0\"). User
    overrides (e.g. \"env\": {\"CCE_EMBED_PARALLEL\": \"4\"} in
    .mcp.json) are respected. Bulk cce index from a separate shell
    still gets the multiprocess path.

  5. SIGINT routes through orderly shutdown (cli._run_serve) — SIGINT,
    SIGTERM, and SIGHUP share one asyncio signal handler that cancels
    the MCP task so the existing finally cleanup runs. Previously only
    SIGTERM worked, leaving SIGKILL as the common path — exactly what
    produced the orphan workers in the first place.

  6. serve.port cleanup (memory/hook_server.py) — on_cleanup
    callback unlinks both port files (storage_base + rendezvous). Stale
    files no longer survive across exits and silently break memory
    capture in the next session.

Tests

  • test_embedder.py +6 cases — zero, string tokens, cpu_count cap,
    negative values, lazy re-evaluation
  • test_watcher.py +3 cases — read-only events ignored, directory
    events skipped, src==dest move is idempotent
  • tests/memory/test_hook_server_cleanup.py (new, 3 cases) — port
    files created and removed, tolerates missing files, default-layout
    edge case

Verification

  • pytest -n 4891 passed, 1 skipped, 0 failed
  • ruff check on every changed file → clean

Test plan

  • On WSL 12 GB cap: cce serve running, then sibling
    cce index --path README.md — no forkserver children should
    appear under serve's PID
  • kill -2 (SIGINT) the cce serve process — clean exit, no
    stale serve.port left behind
  • CCE_EMBED_PARALLEL=0 cce index on a real project — finishes,
    single-process, no orphans
  • Existing user with CCE_EMBED_PARALLEL=4 in .mcp.json — still
    gets 4 workers when running cce index outside serve

Five fixes for the WSL OOM AZagatti reported, all in one PR because they
form a connected failure mode: a sibling `cce index` triggers read-only
events that `cce serve` mis-routes into its reindex worker, which spawns
a fastembed forkserver pool that orphans on shutdown.

1. watcher: switch _DebouncedHandler from on_any_event to typed
   on_modified/on_created/on_deleted/on_moved handlers. Read-only
   `opened`/`closed_no_write` events from the sibling indexer no longer
   queue reindex work. This is the actual trigger.

2. embedder._resolve_parallel: CCE_EMBED_PARALLEL=0 (and the strings
   "none"/"off"/"false"/"no", case-insensitive) now map to None
   (single-process), matching the darwin/win32 default. Previously
   max(1, int(v)) floored to 1, but parallel=1 still takes the
   multiprocessing path. Also caps positive overrides at cpu_count so
   `CCE_EMBED_PARALLEL=64` on a 12-CPU host can't over-spawn.

3. embedder: _resolve_parallel is now evaluated per embed call instead of
   at import. Required for `cce serve` to set the env var inside the
   function and have it take effect.

4. cli._run_serve: defaults CCE_EMBED_PARALLEL=0 via os.environ.setdefault
   so any reindex worker triggered inside `cce serve` runs single-process.
   User override (CCE_EMBED_PARALLEL=4 in .mcp.json) is respected. Bulk
   `cce index` from a separate shell still gets the multiprocess path.

5. cli._run_serve: SIGINT, SIGTERM, and SIGHUP all route through one
   asyncio signal handler that cancels the MCP task, so Ctrl-C now
   triggers the existing finally cleanup instead of being swallowed by
   stdio reads. Previously only SIGTERM worked, leaving SIGKILL as the
   common path — which is exactly what produced the orphan workers.

6. hook_server: register an on_cleanup callback that unlinks both
   serve.port files (storage_base + ~/.cce/projects/<name>/serve.port).
   Stale port files used to survive across exits and silently break
   memory capture in the next session.

Tests added:
- Six new test_embedder.py cases: CCE_EMBED_PARALLEL=0, the string
  tokens, cpu_count cap, negative values, lazy re-evaluation
- Three new test_watcher.py cases: read-only event suppression,
  directory event skip, idempotent moves with src==dest
- New tests/memory/test_hook_server_cleanup.py: port files created and
  removed, tolerates missing files, default-layout edge case

Suite: 891 passed, 1 skipped, 0 failed. Ruff clean.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #66 by breaking the serve+index failure chain that could cause cce serve to spawn fastembed forkserver workers and then orphan/leak them on abnormal shutdown. It does so by tightening watcher event filtering, making embedding parallelism configurable (including a true single-process mode), ensuring cce serve defaults to single-process embedding, improving signal-driven shutdown, and cleaning up stale serve.port files.

Changes:

  • Restrict watcher-triggered reindexing to content-changing filesystem events only.
  • Make CCE_EMBED_PARALLEL support true disable (0 and string tokens), cap values to cpu_count, and resolve lazily per embed call; default cce serve to single-process embedding.
  • Add orderly shutdown handling for cce serve and ensure hook server port files are removed on cleanup; expand regression test coverage.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/context_engine/cli.py Defaults serve to single-process embedding and adds signal handlers to drive orderly shutdown.
src/context_engine/indexer/embedder.py Updates parallel resolution semantics (disable tokens, cpu cap) and evaluates per embed call.
src/context_engine/indexer/watcher.py Switches from on_any_event to typed handlers to ignore read-only events.
src/context_engine/memory/hook_server.py Tracks/wipes both authoritative and rendezvous serve.port files during cleanup.
tests/indexer/test_embedder.py Adds regression tests for new _resolve_parallel behavior.
tests/indexer/test_watcher.py Adds regression tests for event-type filtering behavior.
tests/memory/test_hook_server_cleanup.py New tests validating hook server port-file creation and cleanup behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/context_engine/cli.py
Comment thread tests/indexer/test_watcher.py Outdated
Comment thread src/context_engine/memory/hook_server.py Outdated
Comment thread src/context_engine/indexer/watcher.py
Comment thread src/context_engine/indexer/embedder.py
Five review findings, all resolved:

1. Windows crash: `signal.SIGHUP` is not defined on Windows, so the
   literal tuple `(signal.SIGINT, signal.SIGTERM, signal.SIGHUP)`
   raised AttributeError *before* the try/except could swallow it,
   making `cce serve` unrunnable on Windows. Build the candidate list
   with `getattr(signal, 'SIGHUP', None)` and skip None entries.

2. Test event loop leak: `_make_handler` created an `asyncio.new_event_loop()`
   that was never closed, emitting ResourceWarnings on GC and risking
   flakes under -W error. Tracked the loops in a per-test
   `synthetic_loops` fixture that closes them on teardown.

3. hook_server cleanup comment used to claim it covered SIGKILL paths,
   but `app.on_cleanup` doesn't fire on SIGKILL. Reworded to describe
   what it actually covers (graceful exits) and pointed at the
   socket-liveness probe for the ungraceful case.

4. watcher.on_moved comment said the handler queued "whichever side
   is inside the watch dir" but the implementation didn't actually
   filter — _enqueue/_should_ignore does. Reworded the comment to
   match the real flow.

5. embedder._resolve_parallel docstring used a misleading `"<int>=N"`
   placeholder. Replaced with `"N"` to match the actual parser.

891 passed, 1 skipped, 0 failed. Ruff clean.
Copy link
Copy Markdown
Member

@rajkumarsakthivel rajkumarsakthivel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. All 5 root causes from #66 addressed cleanly.

Watcher event filtering breaks the leak chain. Lazy _resolve_parallel, SIGINT routing, and serve.port cleanup are solid. Windows-safe signal handling with getattr for SIGHUP. 12 new test cases with good coverage including the synthetic_loops fixture for cleanup.

@rajkumarsakthivel rajkumarsakthivel merged commit d4e5e94 into main May 12, 2026
10 checks passed
@rajkumarsakthivel rajkumarsakthivel deleted the fix/66-worker-leak-and-watcher-events branch May 12, 2026 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Forkserver workers leak as orphan processes (~5 GB per cce index) when cce serve is also running

3 participants