Skip to content

fix: Resolve monitor refresh loop#702

Open
johnramsden wants to merge 4 commits intocanonical:mainfrom
johnramsden:john/mon-refresh-loop
Open

fix: Resolve monitor refresh loop#702
johnramsden wants to merge 4 commits intocanonical:mainfrom
johnramsden:john/mon-refresh-loop

Conversation

@johnramsden
Copy link
Copy Markdown
Member

@johnramsden johnramsden commented Mar 26, 2026

Description

Extracts the skip condition in the background monitor refresh loop into shouldSkipMonitorRefresh() and adds a failing test that demonstrates the bug: after the first successful config update, the loop never updates ceph.conf again even when new monitors join the cluster.

Fix involves correcting !first || .. short circuit

After fixing the short circuit a different issue appeared that was failing #702 (comment)

This ended up being more complicated than the original problem to fix.

Fixes #556

Type of change

Delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How has this been tested?

  • Unit test exercises failure and watches for regression
  • Integration test verifies full workflow

Contributor checklist

Please check that you have:

  • self-reviewed the code in this PR
  • added code comments, particularly in less straightforward areas
  • checked and added or updated relevant documentation
  • checked and added or updated relevant release notes
  • added tests to verify effectiveness of this change

Extracts the skip condition in the background monitor refresh loop into
shouldSkipMonitorRefresh() and adds a failing test that demonstrates the
bug: after the first successful config update, the loop never updates
ceph.conf again even when new monitors join the cluster.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@johnramsden johnramsden changed the title John/mon refresh loop fix: Resolve refresh loop Mar 26, 2026
@johnramsden johnramsden changed the title fix: Resolve refresh loop fix: Resolve monitor refresh loop Mar 26, 2026
@johnramsden
Copy link
Copy Markdown
Member Author

We have a bit of an interesting situation here.

There was a bug I've already fixed in the commit that has been pushed. Unfortunately it introduced a race condition that was resolved before due to some incorrectness that actually fixed the race condition and stopped backwardCompatPubnet from executing.

Context:

We have backwardCompatPubnet whos responsibility is to make sure public_network is present in the database.

We then have bootstrap that can end up running concurrently with backwardCompatPubnet because backwardCompatPubnet is occurring in a go routine in ceph.Start.

Bootstrap should be the source of truth because it can end up inserting a slightly different value if the user submits --public-network

backwardCompatPubnet will try to set the missing public_network if it is not present and this can be mismatched to what was passed to bootstrap.

An easy solution here would be to insert it with backwardCompatPubnet if none present, but then upsert it in bootstrap since that should be the source of truth (right now it is failing in bootstrap because of the race condition between it and backwardCompatPubnet). If it's already present backwardCompatPubnet will not write it so we don't need to worry about that condition. The concern I have with this is we could end up with an incorrect public network for a very short amount of time and I would be concerned you could end up with services on the wrong network as a result.

I was thinking this could be tackled with a condition variable gated on bootstrap having been executed but I think the issue here is there are scenarios where it will never run such as restarts.

May need to know some sort of other condition that lets us know bootstrapping has finished. Might need to gate on some sort of arbitrary marker that we have such as something being written to the database.

I will continue with this next week

@johnramsden johnramsden force-pushed the john/mon-refresh-loop branch 2 times, most recently from 6fc4791 to 4fb0e87 Compare March 31, 2026 22:39
johnramsden and others added 2 commits March 31, 2026 15:51
The skip condition used || instead of &&, causing the loop to always
skip after the first successful config update regardless of whether
the monitor list had changed.

Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Before this fix, the background monitor refresh loop called UpdateConfig
on its first iteration unconditionally (first=true), which called
backwardCompatPubnet. With no monitors in the database yet, the compat
shim detected the public network from the node hostname and wrote it via
database.CreateConfigItem. When bootstrap subsequently called
PopulateBootstrapDatabase it attempted to CreateConfigItem the same key,
hitting a unique constraint violation and failing entirely.

The fix gates backwardCompatPubnet on at least one monitor service being
registered in the database. Monitors are only present after bootstrap, so
the shim can no longer race with the bootstrap path.

Also fixes a secondary bug where the error from the CreateConfigItem
transaction was silently dropped (return nil instead of return err).

Three function variables (getMonitorCount, fetchConfigDb,
insertPubnetRecord) are extracted to allow unit testing without a real
database, following the existing pattern in join.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@johnramsden johnramsden force-pushed the john/mon-refresh-loop branch from 4fb0e87 to dd1775b Compare March 31, 2026 22:53
@johnramsden johnramsden marked this pull request as ready for review March 31, 2026 23:17
…on test

The comment in test_sequential_join_mon_hosts claimed a 10-minute timeout
(120 attempts × 5 s) but max_attempts was 24 (2 minutes). Fix the comment
to match the actual value.

Also simplify the ceph.conf polling conditions from
`grep -c ... | grep -q "^[1-9]"` to plain `grep -q`, which is clearer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@UtkarshBhatthere
Copy link
Copy Markdown
Contributor

UtkarshBhatthere commented Apr 3, 2026

I thought the original problem only required changing the skip conditions:

// Check if we need to 
if !first || reflect.DeepEqual(oldMonitors, monitors) {
	logger.Debugf("start: monitors unchanged, sleeping: %v", monitors)
	time.Sleep(time.Minute)
	continue
}

It should be if !first && reflect.DeepEqual(oldMon, monitors)
Since we are checking with A OR B if any of them is true, updation will be skipped. !first is true in all subsequent runs.

@johnramsden
Copy link
Copy Markdown
Member Author

That is correct @utkarsh, there was also a race condition that was caused by this change in the section that was previously dead code. The modification you mentioned has been made and the race condition fix was also made

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Only one mon host in ceph.conf on multi node cluster

2 participants