-
Notifications
You must be signed in to change notification settings - Fork 57
fix: Revert CLIENT_KEEPALIVE back to 60 #641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reverts back to the old default since there is no specific reason to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR reverts the CLIENT_KEEPALIVE timeout value from 120 seconds back to 60 seconds, returning to the previous default setting as there was no specific reason for the increase (issue #639).
- Aligns CLIENT_KEEPALIVE with TOPIC_KEEPALIVE (both now 60 seconds)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| CLIENT_KEEPALIVE = datetime.timedelta(seconds=120) | ||
| CLIENT_KEEPALIVE = datetime.timedelta(seconds=60) | ||
| TOPIC_KEEPALIVE = datetime.timedelta(seconds=60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good with the change as is. But do we think the topic keepalive logic is okay? How often are we sending to a mqtt topic? Although, I guess otherthings should come in naturally that we didn't send, so maybe it's fine? I'll try to test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we want is to establish the topic connection and not have it constantly refresh, but mostly stay around. Before it was doing it on every RPC, which is obviously bad sow we want to find the right interval.
I assume what we want is something like this:
- have some common polling interval
- set the keepalive to be 1.5-2x higher than the polling interval.
So maybe we set this based on common polling interval for map data (alt: make configurable, pass in via callers so home assistant can set it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So i think the rational is:
- we don't normally poll cloud at all
- for map content, which relies on cloud, we use the V1_CLOUD_IN_CLEANING_INTERVAL only during cleaning. That is 30 seconds, then this keeping topics around for 60 seconds is 2x that and within that rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes in an ideal setup, we should just be using cloud for map. 30 seconds seems correct for the interval at which we update.
What classifies as a success for rhe topic keepalive? A message we requested or any message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any time a message is requested it will reset the timer.
So, a caveat is: We do /not/ have a notion of topic health. That said, i don't think opening and closing a topic subscription helps with health and the main connection issues we see are at the session level. So, our MQTT health manager should manage looking for lots of failed RPCs when things are wedged and should reconnect the client.
Just to reset, i think we have two things we care about:
(1) Priority 0: Do not get "stuck" at 3am
(2) Priority 1: Do not cause lots of noise / unavailability at 3am and paper over any reconnects etc. (If someoen tries to start cleaning, too bad, but at least it's not causing disconnect noise if its idle).
I don't think we're done with #2 yet and may need client (home assistant) solutions to ignore refresh errors. e.g. (a) fail 3 times before marking unavailable always (b) ignore errors for ~n times between 3am and 3:10am (c) do not even try to send RPCs at 3am to 3:10am)
This reverts back to the old default since there is no specific reason to change it. This reverts part of #632
Issue #639