fix: Resolve monitor refresh loop#702
Conversation
Extracts the skip condition in the background monitor refresh loop into shouldSkipMonitorRefresh() and adds a failing test that demonstrates the bug: after the first successful config update, the loop never updates ceph.conf again even when new monitors join the cluster. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>
|
We have a bit of an interesting situation here. There was a bug I've already fixed in the commit that has been pushed. Unfortunately it introduced a race condition that was resolved before due to some incorrectness that actually fixed the race condition and stopped Context: We have We then have bootstrap that can end up running concurrently with Bootstrap should be the source of truth because it can end up inserting a slightly different value if the user submits
An easy solution here would be to insert it with I was thinking this could be tackled with a condition variable gated on bootstrap having been executed but I think the issue here is there are scenarios where it will never run such as restarts. May need to know some sort of other condition that lets us know bootstrapping has finished. Might need to gate on some sort of arbitrary marker that we have such as something being written to the database. I will continue with this next week |
6fc4791 to
4fb0e87
Compare
The skip condition used || instead of &&, causing the loop to always skip after the first successful config update regardless of whether the monitor list had changed. Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Before this fix, the background monitor refresh loop called UpdateConfig on its first iteration unconditionally (first=true), which called backwardCompatPubnet. With no monitors in the database yet, the compat shim detected the public network from the node hostname and wrote it via database.CreateConfigItem. When bootstrap subsequently called PopulateBootstrapDatabase it attempted to CreateConfigItem the same key, hitting a unique constraint violation and failing entirely. The fix gates backwardCompatPubnet on at least one monitor service being registered in the database. Monitors are only present after bootstrap, so the shim can no longer race with the bootstrap path. Also fixes a secondary bug where the error from the CreateConfigItem transaction was silently dropped (return nil instead of return err). Three function variables (getMonitorCount, fetchConfigDb, insertPubnetRecord) are extracted to allow unit testing without a real database, following the existing pattern in join.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>
4fb0e87 to
dd1775b
Compare
…on test The comment in test_sequential_join_mon_hosts claimed a 10-minute timeout (120 attempts × 5 s) but max_attempts was 24 (2 minutes). Fix the comment to match the actual value. Also simplify the ceph.conf polling conditions from `grep -c ... | grep -q "^[1-9]"` to plain `grep -q`, which is clearer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>
|
I thought the original problem only required changing the skip conditions: It should be |
|
That is correct @utkarsh, there was also a race condition that was caused by this change in the section that was previously dead code. The modification you mentioned has been made and the race condition fix was also made |
Description
Extracts the skip condition in the background monitor refresh loop into shouldSkipMonitorRefresh() and adds a failing test that demonstrates the bug: after the first successful config update, the loop never updates ceph.conf again even when new monitors join the cluster.
Fix involves correcting
!first || ..short circuitAfter fixing the short circuit a different issue appeared that was failing #702 (comment)
This ended up being more complicated than the original problem to fix.
Fixes #556
Type of change
Delete options that are not relevant.
How has this been tested?
Contributor checklist
Please check that you have: