-
Notifications
You must be signed in to change notification settings - Fork 706
fix: Prevent premature EventManager shutdown when multiple crawlers share it
#1810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
0ca0895
c2a2290
6e15799
4bf288f
ff4c976
53d31e5
88c35d5
838853e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -93,24 +93,20 @@ def __init__( | |
| delay=self._persist_state_interval, | ||
| ) | ||
|
|
||
| # Flag to indicate the context state. | ||
| self._active = False | ||
| # Reference count for active contexts. | ||
| self._active_ref_count = 0 | ||
|
|
||
| @property | ||
| def active(self) -> bool: | ||
| """Indicate whether the context is active.""" | ||
| return self._active | ||
| return self._active_ref_count > 0 | ||
|
|
||
| async def __aenter__(self) -> EventManager: | ||
| """Initialize the event manager upon entering the async context. | ||
| """Initialize the event manager upon entering the async context.""" | ||
| self._active_ref_count += 1 | ||
| if self._active_ref_count > 1: | ||
| return self | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Potential task leak for multi-crawler scenarios; from Claude: Bug:
|
||
|
|
||
| Raises: | ||
| RuntimeError: If the context manager is already active. | ||
| """ | ||
| if self._active: | ||
| raise RuntimeError(f'The {self.__class__.__name__} is already active.') | ||
|
|
||
| self._active = True | ||
| self._emit_persist_state_event_rec_task.start() | ||
| return self | ||
|
|
||
|
|
@@ -127,17 +123,23 @@ async def __aexit__( | |
| Raises: | ||
| RuntimeError: If the context manager is not active. | ||
| """ | ||
| if not self._active: | ||
| if not self.active: | ||
| raise RuntimeError(f'The {self.__class__.__name__} is not active.') | ||
|
|
||
| if self._active_ref_count > 1: | ||
| # Emit persist state event to ensure the latest state is saved before closing the context. | ||
| await self._emit_persist_state_event() | ||
| self._active_ref_count -= 1 | ||
| return | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't call From Claude: Consideration: Intermediate exit skips listener cleanupWhen In practice this is likely fine because individual components ( Worth adding a comment here explaining this design choice — e.g. that individual components are responsible for their own listener cleanup, and |
||
|
|
||
| # Stop persist state event periodic emission and manually emit last one to ensure latest state is saved. | ||
| await self._emit_persist_state_event_rec_task.stop() | ||
Mantisus marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| await self._emit_persist_state_event() | ||
| await self.wait_for_all_listeners_to_complete(timeout=self._close_timeout) | ||
| self._event_emitter.remove_all_listeners() | ||
| self._listener_tasks.clear() | ||
| self._listeners_to_wrappers.clear() | ||
| self._active = False | ||
| self._active_ref_count -= 1 | ||
|
|
||
| @overload | ||
| def on(self, *, event: Literal[Event.PERSIST_STATE], listener: EventListener[EventPersistStateData]) -> None: ... | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure about this. Why should the crawler initialize a component that it doesn't use? And then tear it down? I think we should try to see a bigger picture first.
The linked issues are caused by
SnapshotterandRecoverableStatesubscribing to an event manager that doesn't emit anything - how does that happen?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to remove
event_managerfrom theBasicCrawlerconstructor entirely and keep only the global one, configurable viaservice_locator.set_event_manager. However, this requires discussion and should probably be part of v2.The global
event_managerbeing started inside the crawler, while only being used by the crawler's internal components, is an early architectural decision. I assume that it was done for a better user experience.I believe the root cause is bugs introduced during our work on supporting multiple parallel crawlers:
test_multiple_crawlers_with_global_event_manager. The first crawler to activate theevent_manageris fully responsible for its lifecycle. When the crawler finishes, it tears down theevent_manager, which clears all subscriptions, leaving any still-running crawlers without a functioningevent_manager._service_locatorinBasicCrawler, which caused the user-providedevent_managerto be used only forEvent.CRAWLER_STATUSin_crawler_state_task, while internal components (Snapshotter,RecoverableState) continued subscribing to the globalservice_locator, which may never be started.All that made the behavior of
service_locator.get_event_manager()unstable and unpredictable.This PR is an attempt to fix these issues within the current architecture without introducing breaking changes.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the fixes that solve 2 Crawlers using the same
EventManagerare good. Regarding the global event manager. I see this as a quick fix.I think this would lead to counterintuitive behavior due to
ConfigurationhavingEventManagerspecific fields and they would be ignored in cases like this:Maybe it would be possible for each subscriber to the event manager to have at least an optional init argument that would allow passing an explicit
event_manager(if it is creating storages, then maybeservice_locatorargument would be even better) instead of relying on the global one.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When porting the service locator to crawlee-js (apify/crawlee#3325), I actually made the global
serviceLocatorobject into a proxy that resolves to the crawler-scoped ServiceLocator when called from a crawler instance and to the global one otherwise.If we adopted that in Python as well, it would solve the inconsistency caused by
Snapshotterand friends still using the global locator.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll study that solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you get stuck on something, it's kinda magic. But that's the cost of making the library usable without factories, builders and explicit dependency containers.