Skip to content

sync_switch_configuration: Don't allow dueling Nexuses based on a stale blueprint#10336

Open
jgallagher wants to merge 8 commits intomainfrom
john/bootstore-external-networking-gen
Open

sync_switch_configuration: Don't allow dueling Nexuses based on a stale blueprint#10336
jgallagher wants to merge 8 commits intomainfrom
john/bootstore-external-networking-gen

Conversation

@jgallagher
Copy link
Copy Markdown
Contributor

This commit has two primary changes:

  • Change the bootstore's SystemNetworkingConfig type to have an extra generation number inside it attached to the service NAT entries. This is a copy of the blueprint's external_networking_generation number from the blueprint where the NAT entries were extracted. (This requires bumping the sled-agent API and all the extra machinery for changing the bootstore type; this is a little noisy but very mechanical.)
  • Change the sync_switch_configuration bg task to inspect this generation number and skip updating the bootstore if its currently-loaded blueprint has an older external_networking_generation (i.e., is stale).

Fixes #10320. Staged on top of #10331.

@rcgoodfellow rcgoodfellow added the networking Related to the networking. label Apr 30, 2026
Base automatically changed from john/bp-external-networking-gen to main May 4, 2026 14:24
@jgallagher jgallagher force-pushed the john/bootstore-external-networking-gen branch from 9a4f1a3 to 11d2e05 Compare May 4, 2026 14:48
@sunshowers
Copy link
Copy Markdown
Contributor

Starting to review this:

Fixes #10320.

As discussed in the update watercooler in the morning, this isn't quite right I think, since it is a partial fix. You may want to remove this from the PR description so 10320 doesn't get auto-closed.

@jgallagher
Copy link
Copy Markdown
Contributor Author

Starting to review this:

Fixes #10320.

As discussed in the update watercooler in the morning, this isn't quite right I think, since it is a partial fix. You may want to remove this from the PR description so 10320 doesn't get auto-closed.

Ehh, #10320 notes that it's specifically about the blueprint issue. I should open a separate issue for the network config side.

Copy link
Copy Markdown
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a few questions.

Comment on lines +2463 to +2468
warn!(
log,
"skipping bootstore update due to stale blueprint";
"our-blueprint-gen" => desired_gen,
"bootstore-blueprint-gen" => current_gen,
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth logging the values of rnc_differs and nat_differs here?

Comment thread nexus/src/app/background/tasks/sync_switch_configuration.rs
Comment thread nexus/src/app/background/tasks/sync_switch_configuration.rs
Comment thread nexus/src/app/background/tasks/sync_switch_configuration.rs
Comment on lines +2429 to +2435
// * "Definitely not" if our generation is older than the generation
// currently in the bootstore; this indicates we've produced
// `desired_blueprint_networking_config` based on a stale blueprint.
// * "Definitely yes" if our generation is not older than the current
// bootstore generation and there have been changes to the config.
// * "No information" if our NAT entries haven't changed, we fall through to
// checking the rack network config below.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you rewrite this as a decision table? If I understand correctly "No information" is "desired generation >= current gen and NAT entries equal" -- a decision table would make this clearer and would help visually inspect that all cases are handled.

Comment on lines +52 to +54
// 2. Backwards compatibility: prior versions of this type did not store
// this information at all, and we must be able to cleanly handle that at
// runtime.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"At runtime" here includes within the bootstore, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is kind of vague isn't it? Do you have preference between cutting this off:

    // 2. Backwards compatibility: prior versions of this type did not store
    //    this information at all.

or changing it to specify that we treat older versions as None:

    // 2. Backwards compatibility: prior versions of this type did not store
    //    this information at all. If the bootstore contains an earlier
    //    `SystemNetworkingConfig` that we need to convert to the latest
    //    version, `blueprint_external_networking_config` will be `None`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the second version.

@sunshowers
Copy link
Copy Markdown
Contributor

Ehh, #10320 notes that it's specifically about the blueprint issue. I should open a separate issue for the network config side.

sure, that works :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

networking Related to the networking.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nexuses updating the bootstore-replicated network config can duel

3 participants