-
Notifications
You must be signed in to change notification settings - Fork 57
fix: add a health manager for restarting unhealthy mqtt connections #605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
ba0d287
ef73b80
d38fb26
f096dc0
aae6f1f
2381b1e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| """A health manager for monitoring MQTT connections to Roborock devices. | ||
| We observe a problem where sometimes the MQTT connection appears to be alive but | ||
| no messages are being received. To mitigate this, we track consecutive timeouts | ||
| and restart the connection if too many timeouts occur in succession. | ||
| """ | ||
|
|
||
| import datetime | ||
| from collections.abc import Awaitable, Callable | ||
|
|
||
| # Number of consecutive timeouts before considering the connection unhealthy. | ||
| TIMEOUT_THRESHOLD = 3 | ||
|
|
||
| # We won't restart the session more often than this interval. | ||
| RESTART_COOLDOWN = datetime.timedelta(minutes=30) | ||
|
|
||
|
|
||
| class HealthManager: | ||
| """Manager for monitoring the health of MQTT connections. | ||
| This tracks communication timeouts and can trigger restarts of the MQTT | ||
| session if too many timeouts occur in succession. | ||
| """ | ||
|
|
||
| def __init__(self, restart: Callable[[], Awaitable[None]]) -> None: | ||
| """Initialize the health manager. | ||
| Args: | ||
| restart: A callable to restart the MQTT session. | ||
| """ | ||
| self._consecutive_timeouts = 0 | ||
| self._restart = restart | ||
| self._last_restart: datetime.datetime | None = None | ||
|
|
||
| async def on_success(self) -> None: | ||
| """Record a successful communication event.""" | ||
| self._consecutive_timeouts = 0 | ||
|
|
||
| async def on_timeout(self) -> None: | ||
| """Record a timeout event. | ||
| This may trigger a restart of the MQTT session if too many timeouts | ||
| have occurred in succession. | ||
| """ | ||
| self._consecutive_timeouts += 1 | ||
| if self._consecutive_timeouts >= TIMEOUT_THRESHOLD: | ||
| now = datetime.datetime.now(datetime.UTC) | ||
| if self._last_restart is None or now - self._last_restart >= RESTART_COOLDOWN: | ||
| await self._restart() | ||
| self._last_restart = now | ||
| self._consecutive_timeouts = 0 | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,73 @@ | ||||||||||||||||||||||||||||||||
| """Tests for the health manager.""" | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| import datetime | ||||||||||||||||||||||||||||||||
| from unittest.mock import AsyncMock, patch | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| from roborock.mqtt.health_manager import HealthManager | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| async def test_health_manager_restart_called_after_timeouts() -> None: | ||||||||||||||||||||||||||||||||
| """Test that the health manager calls restart after consecutive timeouts.""" | ||||||||||||||||||||||||||||||||
| restart = AsyncMock() | ||||||||||||||||||||||||||||||||
| health_manager = HealthManager(restart=restart) | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| restart.assert_not_called() | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| restart.assert_called_once() | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| async def test_health_manager_success_resets_counter() -> None: | ||||||||||||||||||||||||||||||||
| """Test that a successful message resets the timeout counter.""" | ||||||||||||||||||||||||||||||||
| restart = AsyncMock() | ||||||||||||||||||||||||||||||||
| health_manager = HealthManager(restart=restart) | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| restart.assert_not_called() | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| await health_manager.on_success() | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| restart.assert_not_called() | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| restart.assert_called_once() | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| async def test_cooldown() -> None: | ||||||||||||||||||||||||||||||||
| """Test that the health manager respects the restart cooldown.""" | ||||||||||||||||||||||||||||||||
| restart = AsyncMock() | ||||||||||||||||||||||||||||||||
| health_manager = HealthManager(restart=restart) | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| with patch("roborock.mqtt.health_manager.datetime") as mock_datetime: | ||||||||||||||||||||||||||||||||
| now = datetime.datetime(2023, 1, 1, 12, 0, 0) | ||||||||||||||||||||||||||||||||
| mock_datetime.datetime.now.return_value = now | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| # Trigger first restart | ||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| restart.assert_called_once() | ||||||||||||||||||||||||||||||||
| restart.reset_mock() | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| # Advance time but stay within cooldown (30 mins) | ||||||||||||||||||||||||||||||||
| mock_datetime.datetime.now.return_value = now + datetime.timedelta(minutes=10) | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| # Trigger timeouts again | ||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| restart.assert_not_called() | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| # Advance time past cooldown | ||||||||||||||||||||||||||||||||
| mock_datetime.datetime.now.return_value = now + datetime.timedelta(minutes=31) | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| # Trigger timeouts again | ||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
| await health_manager.on_timeout() | ||||||||||||||||||||||||||||||||
|
Comment on lines
+66
to
+72
|
||||||||||||||||||||||||||||||||
| # Advance time past cooldown | |
| mock_datetime.datetime.now.return_value = now + datetime.timedelta(minutes=31) | |
| # Trigger timeouts again | |
| await health_manager.on_timeout() | |
| await health_manager.on_timeout() | |
| await health_manager.on_timeout() | |
| # The consecutive timeout counter is now at 3 | |
| assert health_manager._consecutive_timeouts == 3 | |
| # Advance time past cooldown | |
| mock_datetime.datetime.now.return_value = now + datetime.timedelta(minutes=31) | |
| # Even a single timeout now triggers restart | |
| await health_manager.on_timeout() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would technically trigger immediately with gathered function calls. But anything more than this is probably too much complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential follow up could be to keep track of the last timeout. if the timeout is more than say 15 seconds ago, it increases the increment. Could be a follow up PR though, I don't want to slow this one down as we are on a time crunch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're currently sending most commands serially, but yeah this kind of heuristic is hard. i was also considering if we could do it entirely in the mqtt session but hard when you can't correlate incoming and outgoing messages to know if something really did timeout.
Not sure if i get the timeout point you're making but interested in following up.