fix: Prevent premature EventManager shutdown when multiple crawlers share it#1810
fix: Prevent premature EventManager shutdown when multiple crawlers share it#1810Mantisus wants to merge 8 commits intoapify:masterfrom
EventManager shutdown when multiple crawlers share it#1810Conversation
Pijukatel
left a comment
There was a problem hiding this comment.
Nice, I have just some small details
| contexts_to_enter: list[Any] = ( | ||
| [global_event_manager, local_event_manager] if local_event_manager else [global_event_manager] | ||
| ) |
There was a problem hiding this comment.
I'm not so sure about this. Why should the crawler initialize a component that it doesn't use? And then tear it down? I think we should try to see a bigger picture first.
The linked issues are caused by Snapshotter and RecoverableState subscribing to an event manager that doesn't emit anything - how does that happen?
There was a problem hiding this comment.
I would prefer to remove event_manager from the BasicCrawler constructor entirely and keep only the global one, configurable via service_locator.set_event_manager. However, this requires discussion and should probably be part of v2.
I'm not so sure about this. Why should the crawler initialize a component that it doesn't use? And then tear it down? I think we should try to see a bigger picture first.
The global event_manager being started inside the crawler, while only being used by the crawler's internal components, is an early architectural decision. I assume that it was done for a better user experience.
The linked issues are caused by Snapshotter and RecoverableState subscribing to an event manager that doesn't emit anything - how does that happen?
I believe the root cause is bugs introduced during our work on supporting multiple parallel crawlers:
- An edge case we didn't account for, demonstrated by
test_multiple_crawlers_with_global_event_manager. The first crawler to activate theevent_manageris fully responsible for its lifecycle. When the crawler finishes, it tears down theevent_manager, which clears all subscriptions, leaving any still-running crawlers without a functioningevent_manager. - The introduction of an internal
_service_locatorinBasicCrawler, which caused the user-providedevent_managerto be used only forEvent.CRAWLER_STATUSin_crawler_state_task, while internal components (Snapshotter,RecoverableState) continued subscribing to the globalservice_locator, which may never be started.
All that made the behavior of service_locator.get_event_manager() unstable and unpredictable.
This PR is an attempt to fix these issues within the current architecture without introducing breaking changes.
There was a problem hiding this comment.
I think that the fixes that solve 2 Crawlers using the same EventManager are good. Regarding the global event manager. I see this as a quick fix.
I would prefer to remove event_manager from the BasicCrawler constructor entirely...
I think this would lead to counterintuitive behavior due to Configuration having EventManager specific fields and they would be ignored in cases like this:
# Global event manager already exists
BasicCrawler(configuration=Configuration(persist_state_interval=...))
Maybe it would be possible for each subscriber to the event manager to have at least an optional init argument that would allow passing an explicit event_manager (if it is creating storages, then maybe service_locator argument would be even better) instead of relying on the global one.
There was a problem hiding this comment.
When porting the service locator to crawlee-js (apify/crawlee#3325), I actually made the global serviceLocator object into a proxy that resolves to the crawler-scoped ServiceLocator when called from a crawler instance and to the global one otherwise.
If we adopted that in Python as well, it would solve the inconsistency caused by Snapshotter and friends still using the global locator.
There was a problem hiding this comment.
Thanks, I'll study that solution
There was a problem hiding this comment.
Let me know if you get stuck on something, it's kinda magic. But that's the cost of making the library usable without factories, builders and explicit dependency containers.
| """Initialize the event manager upon entering the async context.""" | ||
| self._active_ref_count += 1 | ||
| if self._active_ref_count > 1: | ||
| return self |
There was a problem hiding this comment.
Potential task leak for multi-crawler scenarios; from Claude:
Bug: LocalEventManager not updated for ref-counting — causes task leak
The base class now correctly guards _emit_persist_state_event_rec_task.start() behind ref_count == 1 (line 110). However, LocalEventManager.__aenter__ (in _local_event_manager.py:72-79) unconditionally calls self._emit_system_info_event_rec_task.start() on every entry:
async def __aenter__(self) -> LocalEventManager:
await super().__aenter__()
self._emit_system_info_event_rec_task.start() # Called EVERY time!
return selfSince RecurringTask.start() creates a new asyncio.Task and overwrites self.task without cancelling the previous one (_utils/recurring_task.py:60-66), the old task is leaked (still running, but no reference to cancel it).
In the multi-crawler scenario with the default shared global LocalEventManager:
- Crawler 1 enters →
ref_count=1, system_info task A created - Crawler 2 enters →
ref_count=2, system_info task B created, task A orphaned - Crawler 1 exits →
stop()cancels task B,ref_count=1— but task A is still running with no reference - Crawler 2 exits →
stop()tries to cancel already-cancelled task B — task A never stopped
LocalEventManager needs the same ref-count-aware guards:
async def __aenter__(self) -> LocalEventManager:
await super().__aenter__()
if self._active_ref_count == 1: # Only start on first entry
self._emit_system_info_event_rec_task.start()
return self
async def __aexit__(self, ...) -> None:
if self._active_ref_count == 1: # Only stop on last exit
await self._emit_system_info_event_rec_task.stop()
await super().__aexit__(exc_type, exc_value, exc_traceback)Note: checking == 1 in __aexit__ must happen before super().__aexit__() decrements the count.
| # Emit persist state event to ensure the latest state is saved before closing the context. | ||
| await self._emit_persist_state_event() | ||
| self._active_ref_count -= 1 | ||
| return |
There was a problem hiding this comment.
We don't call wait_for_all_listeners_to_complete in this branch.
From Claude:
Consideration: Intermediate exit skips listener cleanup
When ref_count > 1, the exit emits a persist-state event but does not call wait_for_all_listeners_to_complete() or clean up listeners registered by the exiting crawler.
In practice this is likely fine because individual components (Snapshotter, RecoverableState) clean up their own listeners via off() in their __aexit__/teardown(). But any component that relies on the event manager's remove_all_listeners() as the cleanup mechanism would have stale listeners firing after its context is torn down.
Worth adding a comment here explaining this design choice — e.g. that individual components are responsible for their own listener cleanup, and remove_all_listeners() on final exit is just a safety net.
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Description
EventManagerwas shut down prematurely when the first of multiple concurrent crawlers finished, leaving remaining crawlers with a broken event system.event_manager, even if theevent_managerargument was passed in the crawler`s constructorIssues
event_managerto initialize the crawler, the globalevent_managerdoesn't work. #1808Testing
EventManagerremains active as long as at least one crawler is running, and is shut down only after the last crawler finishes.event_managerinstances are active while the crawler is running.