fix: Resolve monitor refresh loop by johnramsden · Pull Request #702 · canonical/microceph

johnramsden · 2026-03-26T02:05:05Z

Description

Extracts the skip condition in the background monitor refresh loop into shouldSkipMonitorRefresh() and adds a failing test that demonstrates the bug: after the first successful config update, the loop never updates ceph.conf again even when new monitors join the cluster.

Fix involves correcting !first || .. short circuit

After fixing the short circuit a different issue appeared that was failing #702 (comment)

This ended up being more complicated than the original problem to fix.

Fixes #556

Type of change

Delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

How has this been tested?

Unit test exercises failure and watches for regression
Integration test verifies full workflow

Contributor checklist

Please check that you have:

self-reviewed the code in this PR
added code comments, particularly in less straightforward areas
checked and added or updated relevant documentation
checked and added or updated relevant release notes
added tests to verify effectiveness of this change

Extracts the skip condition in the background monitor refresh loop into shouldSkipMonitorRefresh() and adds a failing test that demonstrates the bug: after the first successful config update, the loop never updates ceph.conf again even when new monitors join the cluster. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>

johnramsden · 2026-03-27T22:03:19Z

We have a bit of an interesting situation here.

There was a bug I've already fixed in the commit that has been pushed. Unfortunately it introduced a race condition that was resolved before due to some incorrectness that actually fixed the race condition and stopped backwardCompatPubnet from executing.

Context:

We have backwardCompatPubnet whos responsibility is to make sure public_network is present in the database.

We then have bootstrap that can end up running concurrently with backwardCompatPubnet because backwardCompatPubnet is occurring in a go routine in ceph.Start.

Bootstrap should be the source of truth because it can end up inserting a slightly different value if the user submits --public-network

backwardCompatPubnet will try to set the missing public_network if it is not present and this can be mismatched to what was passed to bootstrap.

An easy solution here would be to insert it with backwardCompatPubnet if none present, but then upsert it in bootstrap since that should be the source of truth (right now it is failing in bootstrap because of the race condition between it and backwardCompatPubnet). If it's already present backwardCompatPubnet will not write it so we don't need to worry about that condition. The concern I have with this is we could end up with an incorrect public network for a very short amount of time and I would be concerned you could end up with services on the wrong network as a result.

I was thinking this could be tackled with a condition variable gated on bootstrap having been executed but I think the issue here is there are scenarios where it will never run such as restarts.

May need to know some sort of other condition that lets us know bootstrapping has finished. Might need to gate on some sort of arbitrary marker that we have such as something being written to the database.

I will continue with this next week

The skip condition used || instead of &&, causing the loop to always skip after the first successful config update regardless of whether the monitor list had changed. Signed-off-by: John Ramsden <john.ramsden@canonical.com>

Before this fix, the background monitor refresh loop called UpdateConfig on its first iteration unconditionally (first=true), which called backwardCompatPubnet. With no monitors in the database yet, the compat shim detected the public network from the node hostname and wrote it via database.CreateConfigItem. When bootstrap subsequently called PopulateBootstrapDatabase it attempted to CreateConfigItem the same key, hitting a unique constraint violation and failing entirely. The fix gates backwardCompatPubnet on at least one monitor service being registered in the database. Monitors are only present after bootstrap, so the shim can no longer race with the bootstrap path. Also fixes a secondary bug where the error from the CreateConfigItem transaction was silently dropped (return nil instead of return err). Three function variables (getMonitorCount, fetchConfigDb, insertPubnetRecord) are extracted to allow unit testing without a real database, following the existing pattern in join.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>

…on test The comment in test_sequential_join_mon_hosts claimed a 10-minute timeout (120 attempts × 5 s) but max_attempts was 24 (2 minutes). Fix the comment to match the actual value. Also simplify the ceph.conf polling conditions from `grep -c ... | grep -q "^[1-9]"` to plain `grep -q`, which is clearer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>

UtkarshBhatthere · 2026-04-03T11:17:22Z

I thought the original problem only required changing the skip conditions:

// Check if we need to 
if !first || reflect.DeepEqual(oldMonitors, monitors) {
	logger.Debugf("start: monitors unchanged, sleeping: %v", monitors)
	time.Sleep(time.Minute)
	continue
}

It should be if !first && reflect.DeepEqual(oldMon, monitors)
Since we are checking with A OR B if any of them is true, updation will be skipped. !first is true in all subsequent runs.

johnramsden · 2026-04-03T17:36:07Z

That is correct @utkarsh, there was also a race condition that was caused by this change in the section that was previously dead code. The modification you mentioned has been made and the race condition fix was also made

johnramsden changed the title ~~John/mon refresh loop~~ fix: Resolve refresh loop Mar 26, 2026

johnramsden changed the title ~~fix: Resolve refresh loop~~ fix: Resolve monitor refresh loop Mar 26, 2026

johnramsden force-pushed the john/mon-refresh-loop branch 2 times, most recently from 6fc4791 to 4fb0e87 Compare March 31, 2026 22:39

johnramsden and others added 2 commits March 31, 2026 15:51

fix: Update monitor refresh loop to run when monitors change

3e4a8b4

The skip condition used || instead of &&, causing the loop to always skip after the first successful config update regardless of whether the monitor list had changed. Signed-off-by: John Ramsden <john.ramsden@canonical.com>

johnramsden force-pushed the john/mon-refresh-loop branch from 4fb0e87 to dd1775b Compare March 31, 2026 22:53

johnramsden marked this pull request as ready for review March 31, 2026 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Resolve monitor refresh loop#702

fix: Resolve monitor refresh loop#702
johnramsden wants to merge 4 commits intocanonical:mainfrom
johnramsden:john/mon-refresh-loop

johnramsden commented Mar 26, 2026 •

edited

Loading

Uh oh!

johnramsden commented Mar 27, 2026

Uh oh!

UtkarshBhatthere commented Apr 3, 2026 •

edited

Loading

Uh oh!

johnramsden commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

johnramsden commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How has this been tested?

Contributor checklist

Uh oh!

johnramsden commented Mar 27, 2026

Uh oh!

UtkarshBhatthere commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnramsden commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

johnramsden commented Mar 26, 2026 •

edited

Loading

UtkarshBhatthere commented Apr 3, 2026 •

edited

Loading