Skip to content

Forkserver workers leak as orphan processes (~5 GB per cce index) when cce serve is also running #66

@AZagatti

Description

@AZagatti

What happened?

I was hitting WSL out-of-memory crashes during long Claude Code sessions with CCE active. Twice the whole VM went down and I had to pkill cce after restart to keep it from happening again immediately. After the last one I dug in with Claude Code as a debug pair to figure out what was going on, and it turned out to be reproducible with a pretty small setup.

The headline: every cce index invocation while cce serve is also running for the same project leaves behind ~5 GB of worker processes that never exit. They become orphans (reparented to init or whatever's left of my shell) and just sit there holding memory until I manually kill -9 them.

One clean repro I captured on my slop-clicker project (SvelteKit + Supabase, ~25k chunks indexed):

1. cce serve --project-dir /path/to/slop-clicker   (PID 213019, ~313 MB idle)
2. another shell: cce index --path README.md
   → "Indexed 0 chunks from 0 files", exits in <2s
3. cce serve's process tree now has:
     213019 cce serve
       ├── 214862  multiprocessing.resource_tracker  (15 MB)
       ├── 214863  multiprocessing.forkserver        (16 MB)
             ├── 214864  worker  (1570 MB, state=R)
             ├── 214865  worker  (843 MB,  state=R)
             ├── 214866  worker  (1570 MB, state=R)
             └── 214867  worker  (1570 MB, state=R)
4. kill -KILL 213019
5. 3 seconds later — workers still alive:
     resource_tracker  PPID=1556 (reparented)  RSS=15 MB   state=S
     forkserver        PPID=1556 (reparented)  RSS=17 MB   state=S
     worker  214864    still ~1.6 GB RSS, transitioned R → S
     worker  214866    still ~1.6 GB RSS, transitioned R → S
     worker  214867    still ~1.6 GB RSS, transitioned R → S
     (one worker exited normally, three didn't)
6. 10s later: still alive, still holding RSS, doing nothing

That's 5.2 GB leaked from one tiny cce index --path invocation. Across a long planning session with multiple commits / indexes / file edits, this compounds. I'm on a 12 GB WSL cap so it didn't take much before things got tight.

Important detail: cce index alone (no concurrent cce serve) doesn't leak. I ran cce index --path README.md three times in a row in a no-serve environment and ended with 0 leftover forkserver processes. The leak only happens when there's a sibling cce serve whose _reindex_worker got triggered by cce index's file-open events. The serve-spawned workers continue running for the queued reindex backlog, then idle, then never exit.

A few related things I noticed while digging:

Watcher reacts to read-only inotify events. During the cce index --path README.md above, watchdog fires 618 opened events plus 618 closed_no_write events for files cce index opens to hash. Zero modified / created / deleted. CCE's watcher (indexer/watcher.py:44) uses on_any_event without filtering by type, so all of them flow into _reindex_pending and the reindex worker tries to process each one. For unchanged files run_indexing skips embedding via the hash check (I verified this — 100 plain cat ... > /dev/null reads only grew cce serve RSS by 3 MB and spawned zero workers). But when cce index is the source of the events, somewhere in run_indexing's path the embedder.embed call does fire even with 0 changed chunks, and that's what spawns the pool above. I didn't trace which exact line — just empirically saw it happen.

No way to make embedding single-process on Linux. _resolve_parallel in indexer/embedder.py:33 returns min(cpu_count, 4) and CCE_EMBED_PARALLEL has a max(1, int(v)) floor:

CCE_EMBED_PARALLEL=unset    → 4
CCE_EMBED_PARALLEL=0        → 1   (still multiprocess)
CCE_EMBED_PARALLEL=1        → 1
CCE_EMBED_PARALLEL=4        → 4
CCE_EMBED_PARALLEL=8        → 8   (no upper cap)
CCE_EMBED_PARALLEL=none     → 4   (string silently ignored)
CCE_EMBED_PARALLEL=off      → 4
CCE_EMBED_PARALLEL=false    → 4

So on a 12-CPU host every cce serve that ends up embedding spawns 4 workers at ~1.6 GB each. On darwin/win32 the default is None (single-process, no fanout) — no equivalent path for Linux/WSL even when you'd want it.

SIGINT and SIGQUIT are ignored by cce serve. Only SIGTERM, SIGHUP, SIGUSR1, SIGUSR2 and stdin EOF cause exit. Probably the asyncio loop swallowing SIGINT without re-raising. Not a memory bug but it bit me when I was trying to clean up orphans with kill -2.

What did you expect?

When cce serve or its forkserver pool shuts down, the multiprocessing children should clean up with it. I'd expect a try ... finally: pool.close(); pool.terminate(); pool.join() around the embed call, or a shutdown handler in _run_serve that propagates to the worker pool before exiting.

For the related items:

  • Watcher should filter event types — on_modified/on_created/on_deleted/on_moved only, not on_any_event.
  • CCE_EMBED_PARALLEL=0 (or none/off) should map to the same single-process path darwin/win32 get for free. And probably cap the upper bound at cpu_count so users can't accidentally over-spawn.
  • SIGINT could be wired up to the same shutdown handler SIGTERM uses.

Steps to reproduce

Pre-stage the fastembed model so the test isn't subject to download issues (separate problem, filing separately):

export FASTEMBED_CACHE_PATH=$HOME/.cache/fastembed
SNAP="$FASTEMBED_CACHE_PATH/models--qdrant--bge-small-en-v1.5-onnx-q/snapshots/52398278842ec682c6f32300af41344b1c0b0bb2"
mkdir -p "$SNAP" && cd "$SNAP"
for f in config.json tokenizer.json tokenizer_config.json special_tokens_map.json model_optimized.onnx; do
  curl -sL -o "$f" "https://huggingface.co/qdrant/bge-small-en-v1.5-onnx-q/resolve/main/$f"
done

Then on any indexed project (I tested on slop-clicker, 25k chunks — anything non-trivial should repro):

# terminal 1
FASTEMBED_CACHE_PATH=$HOME/.cache/fastembed cce serve --project-dir /path/to/proj &
# wait for "CCE ready ..." in stderr

# terminal 2
FASTEMBED_CACHE_PATH=$HOME/.cache/fastembed cce index --path README.md
# completes in ~2s, "Indexed 0 chunks from 0 files"

# terminal 3 — inspect cce serve's process tree
SERVE_PID=$(pgrep -f 'python.*cce serve' | head -1)
pstree -p $SERVE_PID
# you'll see the forkserver supervisor + 4 workers, ~1.6 GB each

# kill cce serve and check
kill -KILL $SERVE_PID
sleep 3
ps aux | grep -E 'multiprocessing\.(forkserver|resource_tracker)' | grep -v grep
# the workers are still there, now orphans. SIGKILL to clean up.

Relevant logs or error output

Process snapshots and timelines from the investigation:
https://gist.github.com/AZagatti/7393f669a0fd785d7153e07a52a11127

Most relevant for this bug:

  • 04-index-process-poll.txt — 60 ticks of process state during a full cce index, 12 distinct PIDs, ~8.7 GB combined RSS at peak
  • 06-idle-vs-index-poll.txt — cce serve idle for 90s (stable, no children), then sibling cce index ran and serve spawned its own forkserver pool
  • 07-watcher-event-types.log — the 1236 inotify events from cce index --path README.md, 99% read-only

Python version

3.13.5

OS

Ubuntu 24.04 LTS on WSL2 (kernel 6.6.87.2-microsoft-standard-WSL2)

CCE version

0.4.19

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions