Skip to content

Introduce sled-agent-scrimlet-reconcilers crate#10313

Open
jgallagher wants to merge 10 commits intomainfrom
john/scrimlet-reconcilers-1
Open

Introduce sled-agent-scrimlet-reconcilers crate#10313
jgallagher wants to merge 10 commits intomainfrom
john/scrimlet-reconcilers-1

Conversation

@jgallagher
Copy link
Copy Markdown
Contributor

@jgallagher jgallagher commented Apr 23, 2026

This is groundwork for #10167, and introduces the skeleton of network config reconcilers for use within sled-agent. None of this is wired up yet and all the service-specific reconcilers are placeholders, but it does have the real setup for how these tasks get started and how they report status.

The PR is pretty big but hopefully not too bad to review; more than half the code falls into either "tests", "status type definitions", or "placeholder/dummy reconcilers". A tentative suggestion for review order is:

  1. The crate-level docs in lib.rs; these are written assuming Tracking issue: Moving system-level networking reconciliation from Nexus to sled-agent #10167 is complete, not based on the current state of the crate.
  2. handle.rs, particularly ScrimletReconcilers - this is the entry point for sled-agent. It will hold a ScrimletReconcilers in its set of long-running tasks.
  3. reconciler_task.rs - this implements the common control flow for all of the service-specific reconcilers in the crate; handling periodic reactivation, activation when the config changes, transitioning to inert if we stop being a scrimlet because the sidecar goes away at runtime, and transitioning out of inert if it comes back.

The only production-affecting change here is that the ThisSledSwitchZoneUnderlayIpAddr type moved out of sled-agent and into this crate, so sled-agent depends on this crate just for that type. Edit: As of #10340, ThisSledSwitchZoneUnderlayIpAddr has moved to sled-agent-types, so now this PR uses it from there and makes no changes to sled-agent proper.

@rcgoodfellow rcgoodfellow added the networking Related to the networking. label Apr 30, 2026
@internet-diglett
Copy link
Copy Markdown
Contributor

Working on getting through this review today

}

async fn determine_switch_slot(
running_reconcilers: Arc<OnceLock<RunningReconcilers>>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about OnceLock

Copy link
Copy Markdown
Contributor

@internet-diglett internet-diglett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I understand this is just a skeleton, but I wanted to ask about one of the config bits just in case:

I noticed the initial implementation relies on hard coded port numbers. If we're running these reconcilers inside of integration tests we might need a way to inject the correct port numbers. This will also be necessary if the reconcilers are going to work in #9533 (we will be expecting that any config changes made via the operator APIs will eventually show up on mgd, dpd, etc.)

Is this something you are already accounting for with this design?

@jgallagher
Copy link
Copy Markdown
Contributor Author

LGTM. I understand this is just a skeleton, but I wanted to ask about one of the config bits just in case:

I noticed the initial implementation relies on hard coded port numbers. If we're running these reconcilers inside of integration tests we might need a way to inject the correct port numbers. This will also be necessary if the reconcilers are going to work in #9533 (we will be expecting that any config changes made via the operator APIs will eventually show up on mgd, dpd, etc.)

Is this something you are already accounting for with this design?

We chatted about this live. Tentative plan:

  • Change SledAgentNetworkingInfo from a struct to an enum that allows us to say "use these socket addresses for services instead of assuming their hard-coded ports inside the switch zone". I think in this mode we won't run the SMF-based reconcilers at all (for now), since they expect to connect to a real SMF inside an oxz_switch zone, neither of which exists in many of our test envs.
  • Figure out how to wire this up for #[nexus_test] based tests. Probably this means having one or more sim-sled-agent run these reconcilers in the test mode added in the previous bullet.
  • Consider using real services instead of httpmock for testing this crate itself. (E.g., in this PR I'm using mocks to sit in for MGS; if we have this test config mode, can I spin up real MGS in its test mode and point to that instead?)

@jgallagher
Copy link
Copy Markdown
Contributor Author

We chatted about this live. Tentative plan:

  • Change SledAgentNetworkingInfo from a struct to an enum that allows us to say "use these socket addresses for services instead of assuming their hard-coded ports inside the switch zone". I think in this mode we won't run the SMF-based reconcilers at all (for now), since they expect to connect to a real SMF inside an oxz_switch zone, neither of which exists in many of our test envs.
  • Figure out how to wire this up for #[nexus_test] based tests. Probably this means having one or more sim-sled-agent run these reconcilers in the test mode added in the previous bullet.
  • Consider using real services instead of httpmock for testing this crate itself. (E.g., in this PR I'm using mocks to sit in for MGS; if we have this test config mode, can I spin up real MGS in its test mode and point to that instead?)

Bullet one is done in 7eeaf1d; instead of always taking a switch zone underlay IP, tests can now instead pass a set of SocketAddrs for each of the relevant services.

Bullet two is future work once we start integrating this, but hopefully we now have the tools to handle it.

Bullet three is (mostly) done in 885b6c5 - I kept httpmock for some of the failure path tests, but the happy path tests now spin up a real MGS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

networking Related to the networking.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants