Skip to content

perf(watchdog): batch registrations, exclude noise directories, and 9p mount exclude#1621

Open
linkliti wants to merge 2 commits into
agent0ai:readyfrom
linkliti:watchdog-improv
Open

perf(watchdog): batch registrations, exclude noise directories, and 9p mount exclude#1621
linkliti wants to merge 2 commits into
agent0ai:readyfrom
linkliti:watchdog-improv

Conversation

@linkliti
Copy link
Copy Markdown
Contributor

@linkliti linkliti commented May 9, 2026

A0 startup was painfully slow on my machine. On Win 11 setup with A0 and Docker both on HDD, registering watchdogs took over 11 minutes because the observer was scheduling watches across 60k files and 8k folders (mostly because of .git and .venv).

Three changes fix this:

  1. Batch registrations. Each add_watchdog call was triggering a full observer reschedule (O(n^2)). Added batch_watchdogs() context manager so all registrations in a group defer the refresh until the end. Dropped 708s to 226s alone.
  2. Exclude noise directories from scheduling. Instead of recursive=True on root paths, switch to recursive=False per directory and enumerate watchable dirs with os.walk, pruning pycache, .venv, node_modules, .git, and ~80 other noise folders from get_noise_folders(). These directories are never scheduled at all, so the observer never watches them. Dropped to 97s alone, 31s combined with batching. This can be additionally optimized for .a0proj/plugins to skip watching projects folder entirely for cases when _time_travel plugin is disabled.
  3. Skip 9p mounts entirely. On WSL2 Docker, /a0/usr is often a 9p bind mount from the Windows host. inotify (used by watchdog) produces zero events on 9p (confirmed with standalone tests: InotifyObserver fires nothing, while PollingObserver catches everything). Detecting 9p via /proc/mounts and skipping those roots in add_watchdog drops the total to under 1 second for WSL mount since no events produced anyway (Related: [WSL2] File changes made by Windows apps on Windows filesystem don't trigger notifications for Linux apps microsoft/WSL#4739).

Benchmarks of _10_register_watchdogs.py (_time_travel plugin was disabled, but improvements also affect its watchdog):

  • Default - RegisterWatchDogs executed in 708.511422 seconds
  • Batch registrations - RegisterWatchDogs executed in 226.029624 seconds
  • Noise dirs exlude only - RegisterWatchDogs executed in 97.740067 seconds
  • Both - RegisterWatchDogs executed in 31.360905 seconds (mostly affected by custom named windows venv in one of my projects)
  • Both + 9p exclude - RegisterWatchDogs executed in 0.568248 seconds (Windows only, for proper fix polling is needed but not implemented in favor of existing API calls and extensions)

…kip 9p mounts

Add batch_watchdogs() context manager to defer observer refreshes until all watchdog registrations complete, eliminating redundant reschedule cycles during initialization. Switch from recursive=True to per-directory recursive=False scheduling with os.walk pruning so noise folders like __pycache__, .venv, and node_modules are never observed in the first place. Refresh the observer on directory create and move events to pick up new subdirectories under non-recursive watches.

Extract a reusable get_noise_folders() into helpers/exclusion.py covering version control, build output, caches, and language-specific directories, replacing the hardcoded __pycache__ ignore patterns in watchdog defaults. Detect 9p remote mounts via /proc/mounts and skip them in add_watchdog since inotify produces zero events on 9p filesystems.
@revworxai
Copy link
Copy Markdown

Hi @linkliti — thanks for this PR, the noise-folder exclusion and 9p mount handling are nice quality-of-life wins. I want to flag a related correctness issue I just root-caused independently in helpers/watchdog.py that I think this PR is one line away from also fixing permanently.

Symptom: full startup deadlock (not just "slow")

On a stock build of agent0ai/agent-zero:latest (regression introduced in e4f974b — "refactor: add file system watchdog support for API handlers, extensions, and plugins"), run_ui reliably hangs at boot before it ever binds its port. The container is healthy and supervisor reports run_ui RUNNING, but ss -ltnp shows no listener on :80 and the WebUI displays "Backend Disconnected, cannot restart". The hang persists indefinitely — it's not just slow; the process is permanently wedged.

Root cause: AB-BA lock-ordering bug

_WatchRegistry.add() (and remove(), clear()) call self._refresh_observer() while still holding self._lock. _refresh_observer() calls observer.unschedule_all() on the underlying watchdog.observers.Observer, which needs the Observer's internal dispatch thread to be quiescent. But that dispatch thread routes events through _WatchRegistry.dispatch(), which itself acquires self._lock to read self._watches. Result: classic AB-BA circular wait once any plugin gets registered and a single event has fired.

The author of the original code already knew the right pattern — _stop_observer() captures state under self._lock, releases the lock, then calls observer.unschedule_all()/stop()/join(). The bug is that add()/remove()/clear() don't follow the same pattern.

py-spy evidence (from a wedged pid)

Process <pid>: /opt/venv-a0/bin/python /a0/run_ui.py --dockerized=true --port=80 --host=0.0.0.0

Thread <pid> (idle): "MainThread"
    unschedule_all (watchdog/observers/api.py:370)
    _refresh_observer (helpers/watchdog.py:235)
    add (helpers/watchdog.py:120)
    add_watchdog (helpers/watchdog.py:384)
    register_watchdogs (helpers/plugins.py:157)
    execute (extensions/.../_10_register_watchdogs.py:10)
    call_extensions_sync (helpers/extension.py:246)
    _run_sync (helpers/extension.py:201)
    run (run_ui.py:20)
    <module> (run_ui.py:88)

Thread <thread-1> (idle): "Thread-1"
    dispatch (helpers/watchdog.py:172)            # ← also wants self._lock
    dispatch (helpers/watchdog.py:17)
    dispatch_events (watchdog/observers/api.py:391)
    run (watchdog/observers/api.py:213)

MainThread holds self._lock, waits for unschedule_all() to return, which needs Thread-1 to be idle. Thread-1 is sitting in dispatch() waiting for self._lock. Deadlocked. wchan = futex_wait_queue on both. bind() on port 80 never reached.

How this PR interacts with the bug

This PR introduces the elegant batch_watchdogs() context manager which gates the refresh:

if not self._batching:
    self._refresh_observer()

During startup (which is the only place anyone reliably hit the wedge today), that branch is skipped — so this PR eliminates the symptom at boot, and that's why it likely already "works" in your testing.

However, the underlying lock-ordering bug is still latent — the gated call (and the equivalent ungated paths after batch_watchdogs() is left, and the runtime add_watchdog() call from plugin hot-reload / MCP service registration / dynamic project mounts) all still call self._refresh_observer() while holding self._lock. Anyone hitting those code paths in production can wedge the same way.

Proposed amendment to this PR

Three tiny dedents apply the _stop_observer() pattern uniformly. Combined with your batching changes, the result is both fast and deadlock-free in all paths:

  def add(...):
      with self._lock:
          ...
          self._watches.update(watches)
          self._watch_ids_by_group[id] = set(watches)
-         if not self._batching:
-             self._refresh_observer()
+     if not self._batching:
+         self._refresh_observer()

  def remove(self, id: str) -> bool:
      with self._lock:
          ...
-         if removed and not self._batching:
-             self._refresh_observer()
-         return removed
+     if removed and not self._batching:
+         self._refresh_observer()
+     return removed

  def clear(self) -> None:
      with self._lock:
          ...
          pending_batches = list(self._pending_batches.values())
          self._pending_batches.clear()
-         if not self._batching:
-             self._refresh_observer()
+     if not self._batching:
+         self._refresh_observer()
      for pending in pending_batches:
          ...

Net diff: 3 lines moved up one indent level. No semantic change to your perf work; just closes the AB-BA window for the non-batch path.

The new batch.__exit__ path you added (self._refresh_observer() after setting self._batching = False) is already correctly outside any with self._lock: block.

Also worth noting: the new _refresh_observer body in this PR adds the threading.Thread(target=self._refresh_observer, daemon=True).start() call from dispatch() — that's safe because it's not holding self._lock there, but it does mean refreshes can now race with add() more frequently in non-batch paths, which makes the dedent-out-of-lock fix more important, not less.

Reproducer

  1. docker run --rm -it agent0ai/agent-zero:latest (vanilla, no mods)
  2. Wait ~30s for init
  3. docker exec <id> ss -ltn | grep :80 → empty
  4. docker exec <id> curl -s -o /dev/null -w '%{http_code}\n' http://127.0.0.1/000 (instant refused, not timeout)
  5. docker exec <id> /opt/venv-a0/bin/pip install py-spy && docker exec <id> /opt/venv-a0/bin/py-spy dump --pid $(docker exec <id> pgrep -f run_ui.py) → stack as shown above

If you'd prefer to keep this PR scope-locked to perf

Happy to open a small follow-up PR (1 file, 6 lines) immediately after this one merges, referencing this discussion. Just let me know your preference — I have the patch and reproducer ready to go either way.

Thanks again for tackling the watchdog perf work — once the lock fix is in, this whole subsystem will be substantially more robust.

…ove/clear

Dedent _refresh_observer() calls out of with self._lock: blocks in _WatchRegistry.add(), remove(), and clear() to break an AB-BA deadlock between _WatchRegistry._lock (L1) and BaseObserver._lock (L2) in `/opt/venv-a0/lib/python3.12/site-packages/watchdog/observers/api.py`. The main thread held L1 while calling unschedule_all() which needs L2; the observer dispatch thread held L2 while calling dispatch() which needs L1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants