Skip to content

WIP: PMBus Status Registers via MGS#2538

Draft
jamesmunns wants to merge 5 commits into
masterfrom
james/pmbus-please
Draft

WIP: PMBus Status Registers via MGS#2538
jamesmunns wants to merge 5 commits into
masterfrom
james/pmbus-please

Conversation

@jamesmunns
Copy link
Copy Markdown
Contributor

This is a (still fairly early) PR towards #2463. Opening as draft PR to solicit some feedback.

This provides rail to function mapping to obtain I2cDevice and rail indexes.
Comment thread drv/i2c-devices/src/lib.rs Outdated
Comment on lines +314 to +315
status_fans_1_2: pmbus_rail_read!(dev, rail_idx, STATUS_FANS_1_2).map_err(drop)?.0,
status_fans_3_4: pmbus_rail_read!(dev, rail_idx, STATUS_FANS_3_4).map_err(drop)?.0,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm...I'd be interested to know what happens if we try to read either STATUS_FANS register from a device that doesn't have fans, or if we try to read STATUS_FANS_3_4 from a device that only has 1 or 2 fans. Do we get back a read that's all zeroes, or does the device return a NAK or something?

§17.10 and §17.10 in the PMBus Specification, Part II, Version 1.3.1 doesn't explicitly say what a device without fans is supposed to do, so, if I had to guess, it's probably left up to the manufacturer...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempting to find out what happens when one does this experimentally, and I have discovered that, using the humility pmbus command to ask a RAA229620A for its STATUS_FANS_1_2, it just...prints nothing:

eliza@alfred ~ $ pfexec /staff/eliza/humility -t cosmo-hubris-sp pmbus -r VDDCR_CPU0_A0 -C STATUS_FANS_1_2
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
eliza@alfred ~ $

I'm not actually sure what "nothing" means here!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verbose mode also makes it say "nothing":

eliza@alfred ~ $ pfexec /staff/eliza/humility -t cosmo-hubris-sp pmbus -r VDDCR_CPU0_A0 -C STATUS_FANS_1_2 --verbose
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
eliza@alfred ~ $

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH, you have to pass the --errors flag to humility pmbus to make it show errors. So the RAA229620A returns Noregister when you ask for its STATUS_FANS_1_2 or STATUS_FANS_3_4:

eliza@alfred ~ $ pfexec /staff/eliza/humility -t cosmo-hubris-sp pmbus -r VDDCR_CPU0_A0 -C STATUS_FANS_1_2,STATUS_FANS_3_4 --verbose --errors
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
0x81 STATUS_FANS_1_2           Err(NoRegister)
0x82 STATUS_FANS_3_4           Err(NoRegister)
eliza@alfred ~ $

The ISL68224, ADM127X, TPS546B24A also do the same when asked for their STATUS_FANS registers, while the BMR491 appears to...still do something that results in Humility printing nothing?

eliza@alfred ~ $ pfexec /staff/eliza/humility -t cosmo-hubris-sp pmbus -r VDDCR_CPU0_A0 -C STATUS_FANS_1_2 --verbose --errors
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
0x81 STATUS_FANS_1_2           Err(NoRegister)
eliza@alfred ~ $ pfexec /staff/eliza/humility -t cosmo-hubris-sp pmbus -r VDDCR_CPU0_A0 -C STATUS_FANS_1_2,STATUS_FANS_3_4 --verbose --errors
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
0x81 STATUS_FANS_1_2           Err(NoRegister)
0x82 STATUS_FANS_3_4           Err(NoRegister)
eliza@alfred ~ $ pfexec /staff/eliza/humility -t cosmo-hubris-sp pmbus -r V1P8_SP5_A0 -C STATUS_FANS_1_2,STATUS_FANS_3_4 --verbose --errors
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
humility pmbus failed: rail V1P8_SP5_A0 not found
eliza@alfred ~ $ pfexec /staff/eliza/humility -t cosmo-hubris-sp pmbus -r V1P8_SP5_A1 -C STATUS_FANS_1_2,STATUS_FANS_3_4 --verbose --errors
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
0x81 STATUS_FANS_1_2           Err(NoRegister)
0x82 STATUS_FANS_3_4           Err(NoRegister)
eliza@alfred ~ $ pfexec /staff/eliza/humility -t cosmo-hubris-sp pmbus -r V54P5_IBC_A3 -C STATUS_FANS_1_2,STATUS_FANS_3_4 --verbose --errors
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
0x81 STATUS_FANS_1_2           Err(NoRegister)
0x82 STATUS_FANS_3_4           Err(NoRegister)
eliza@alfred ~ $ pfexec /staff/eliza/humility -t cosmo-hubris-sp pmbus -r V3P3_SP_A2 -C STATUS_FANS_1_2,STATUS_FANS_3_4 --verbose --errors
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
0x81 STATUS_FANS_1_2           Err(NoRegister)
0x82 STATUS_FANS_3_4           Err(NoRegister)
eliza@alfred ~ $ pfexec /staff/eliza/humility -t cosmo-hubris-sp pmbus -r V12_SYS_A2 -C STATUS_FANS_1_2,STATUS_FANS_3_4 --verbose --errors
humility: attached to 1fc9:0143:XJACFXSEKTQJS via CMSIS-DAP
eliza@alfred ~ $

This is sort of what I had figured, and suggests that we need to either have some additional configuration option that says whether or not to read the fans, and/or (and this would be my preference) change PmbusStatus::read_from to not bail out if any read returns an error, as I suggested in #2538 (comment).

// the same for all the items above, and save ourselved a little indirection, but for now
// just hole-punch the minimum amount necessary
status_mfr_specific: pmbus_rail_read!(@raw => dev, rail_idx, CommandCode::STATUS_MFR_SPECIFIC as u8, 1).map_err(drop)?[0],
})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that this will bail out if any read returns an error. I think it would be much better to change the PmbusStatus type so that each register's field was a Result<u8, i2c::ResponseCode> or something. That way, we can always communicate whatever registers we were able to successfully read to upstack software even if we encounter some weird I2C bus weather or something. Also, if (as I mentioned in my other comment) a device without fans does choose to NAK one of the STATUS_FANS registers, that doesn't result in us being permanently unable to read its status.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do! I wondered whether to make these Options or Results, and I should definitely do one of those.

I wonder if we could capture "this device supports these statuses" in the pmbus crate and maybe codegen based on this so we don't spend time doing I2C we know is going to be NAK'd, but this could also be a good follow-up.

Since we're not using device-specific methods here, I would imagine we would want to pass in some kind of bit-flag of which ones to check or not, and just have a None or Err(Unsupported).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could capture "this device supports these statuses" in the pmbus crate and maybe codegen based on this so we don't spend time doing I2C we know is going to be NAK'd, but this could also be a good follow-up.

Yeah, we could either do this in the pmbus crate or in the codegen itself. Since we have a string saying which device it is, the codegen could have its own understanding of which registers to skip for certain devices. It probably should fall back to querying all registers for device strings it doesn't recognize, maybe along with a build warning.

I do worry slightly about a device firmware revision changing its behavior to something that doesn't match the codegen's understanding, but that feels pretty unlikely. For instance, an RAA229620A VRM is not going to suddenly grow fans if its firmware is updated!


Incidentally, there is another option, which is the PMBus QUERY command (§11.13 PMBus Specification, Part II, Version 1.3.1). This allows you to ask a device whether it supports a particular PMBus command, such as STATUS_FANS_1_2 and get back a response that tells you whether the device supports it. I'm not sure how useful this is to us here, though --- if the goal is mostly to avoid I2C traffic for unsupported registers, using QUERY to ask "hey, do you have STATUS_FANS_1_2? okay, do you have STATUS_FANS_3_4?" results in just as much I2C traffic if the device doesn't have those registers, and more if it does. We could probably avoid that by QUERYing once and caching the result somewhere, but the current code is not structured in a way that gives us a place to stash that, so it would be a bit more complex.

At the end of the day, I just don't really think that's particularly worthwhile: since we know all the devices on the board at build time, we can just decide which registers to read when we do the code generation, and we don't have to dynamically detect capabilities. I think that QUERY is more intended for the use case of commodity motherboards which might end up talking to any arbitrary commodity PSUs, where the firmware doesn't know what PMBus commands are supported until the system boots up, and therefore must dynamically detect them. But, I felt like it was worth mentioning, if only to write down why we probably don't need to use it.

Comment thread task/validate-api/build.rs Outdated
Comment on lines +8 to +11
if let Err(e) = build_i2c::codegen(build_i2c::Disposition::Devices) {
println!("cargo::error=failed to generate I2C devices: {e}");
std::process::exit(1);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soooo, I realize that in my VPD PR, I added all the code generation to task-validate-api, but I'm not sure if we need to do that here. In the VPD branch (which I need to resurrect), it was necessary to add the VPD-reading code to task-validate-api's codegen, because validate generates the list of DeviceDescriptions, and I wanted to add device VPD to the component_details MGS API, which is driven by the list of DeviceDescriptions from task-validate-api.

Since we're looking up devices for reading PMBus status using a new lookup table of rail names to devices, I think there isn't actually any reason that that needs to go in task-validate-api. I think it might be nicer to just put that in control-plane-agent's build script, or make it a Disposition for build_i2c::codegen or something. That way, the validate API crate need not actually inflict all the I2C codegen on every task that depends on it. What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds reasonable! I put it in task-validate-api because I was copying your VPD PR, but I admit to wondering "why does it need to be here/like this?".

I'll move it over to control plane agent, or as a new codegen path.

Comment thread task/validate-api/build.rs Outdated
Comment on lines +135 to +138
// TODO: do we need fixed-length bstrings? If we want to binary search, we probably want them to
// be truncated to some maximum length, so we either need to define the max length at the
// protocol level so mgs knows how long of a string to send us, or we could instead trim the
// trailing nulls and search by that instead.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suspect the lookup code is going to end up looking something like the code in control-plane-agent for looking up a component by ID:

impl TryFrom<&'_ SpComponent> for Index {
type Error = SpError;
fn try_from(component: &'_ SpComponent) -> Result<Self, Self::Error> {
if let Ok(entry_idx) = task_validate_api::DEVICE_INDICES_BY_SORTED_ID
.binary_search_by_key(&component.id, |&(id, _)| id)
{
let &(_, index) = task_validate_api::DEVICE_INDICES_BY_SORTED_ID
.get(entry_idx)
.unwrap_lite();
return Ok(Self::ValidateDevice(index));
}
for (i, d) in OUR_DEVICES.iter().enumerate() {
if *component == d.component {
return Ok(Self::OurDevice(i));
}
}
Err(SpError::RequestUnsupportedForComponent)
}
}

as for maximum length, i think that the wire message will certainly have to have one, a la SpComponent in gateway-messages: https://github.com/oxidecomputer/management-gateway-service/blob/6c0aca2545a73fd75536e149d29faa7108be5862/gateway-messages/src/lib.rs#L284-L294

I really don't love the null-padding, but even if we were to adopt an encoding that let us not do that part, we would still need to have some max length somewhere so we can limit the buffer size...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the distinction that I wanted to make (or at least still need to shake out) is that just because it needs to be zero-padded on the wire, that doesn't necessarily mean that internally we ALSO need to zero-pad it.

When receiving the command over the wire, we could first trim any trailing nulls, and then use the slice locally instead. That being said, if I bump into anywhere else where we need fixed length arrays, I'll probably revert back to zero-padding in the codegen.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think avoiding zero-padding here is probably nicer if you can manage it --- it should save us a bit of flash in the lookup table even if it doesn't save us the stack space when we're passing around the wire message.

@jamesmunns jamesmunns force-pushed the james/pmbus-please branch from c0943ba to c806bcb Compare May 28, 2026 13:11
Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good --- I'm looking forward to seeing the control-plane-agent code that actually consumes it. I had some smallish suggestions on the codegen.

Comment on lines +306 to +309
/// This function returns an error if the initial attempt to obtain `STATUS_WORD` from the
/// device fails, otherwise returnining successfully even if "leaf" status bytes were unable
/// to be read, either due to ephemeral hiccups, or that status byte being unsupported by
/// the device queried.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely convinced this behavior for STATUS_WORD is correct. If you look at the bits in STATUS_WORD, you'll see that it's basically a summary of which other registers have bits set --- there's a bit in STATUS_WORD that is set if any fault bit in STATUS_IOUT, one for faults in STATUS_VOUT, and so on. So, we actually can reasonably interpret all the other registers even if we're missing STATUS_WORD.

That said, I think it's also reasonable-ish to fail fast if we get an error that indicates there is no device to talk to at all; if we get a NAK that indicates its not present, we don't need to spend more time doing all the other I2C reads. I dunno.

}
writeln!(file, "];")?;

assert_eq!(pmbus_rail_dupes, 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: it would be nice if this had a panic message that said something to the effect of "your config is invalid"...also, it's maybe worth checking whether panicking or returning an anyhow::Error here results in a nicer UX for error reporting in build scripts?

Comment on lines +68 to +69
// pmbus devices must have a refdes
assert!(dev.device_id.is_some());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually care about this here?

// `BTreeMap::insert().is_some()`!
if !pmbus_rail_names.insert(rail.name.clone()) {
pmbus_rail_dupes += 1;
println!("cargo::error=Duplicate Rail name: {:?}", rail.name);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this error message is what the programmer will see if they have made an error in the app.toml. i think we can probably improve the user experience here a bit, at least by making the following changes:

  • downcasing the first letter to match other cargo error messages
  • including the name of the PMBus device that has the duplicate rail.
  • it would probably be nice if we could also say the refdes of the other device that has that rail, so we could say "both U420 and U69 define a power rail called V69P7_TURBOENCABULATOR_A2, your config file is invalid", but that would require a bigger change to also track which device a rail name in the map came from...
Suggested change
println!("cargo::error=Duplicate Rail name: {:?}", rail.name);
println!("cargo::error=PMBus device {} defines a power rail {:?} which already exists in the manifest", dev.device_id, rail.name);

or something like that?

// Create a mapping between rail names and generated accessor functions for obtaining
// the device handle and rail index
writeln!(file)?;
writeln!(file, "pub const PMBUS_RAIL_TO_I2C_DEVICE_MAP: [(&[u8], fn(userlib::TaskId) -> (drv_i2c_api::I2cDevice, u8)); {}] = [", pmbus_rail_names.len())?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, take it or leave it: part of me wants to suggest that this be a struct rather than a tuple, in the hopes that the non-generated code consuming the LUT is a bit more readable. but i'm not sure if this is worth the ffort or not. up to you.

Comment on lines +88 to +89
// build_i2c *also* only to-lowercases the rail names to make functions
write!(file, "crate::i2c_config::pmbus::{}", rail.to_lowercase())?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually, if we wanted to be fancy, we could probably change up build-i2c so that it actually provides the name of the i2c_config::pmbus function for looking up that device, instead of relying on that code not changing. but also, that code is not going to change without breaking a bunch of other stuff, so 🤷‍♀️

could be worth adding to the backlog of "non-important non-blockers"...

Comment on lines +51 to +55
// Generate the necessary rail names
if let Err(e) = build_i2c::codegen(build_i2c::Disposition::Devices) {
println!("cargo::error=failed to generate I2C devices: {e}");
std::process::exit(1);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicky: i might put this part in main rather than in the pmbus-specific codegen? we will want it if other stuff also starts wanting i2c config in control-plane-agent...

Comment on lines +57 to +62
let devices = build_i2c::device_descriptions()
.collect::<Vec<_>>();

let mut pmbus_rail_names = std::collections::BTreeSet::new();
let mut pmbus_rail_dupes = 0;
for dev in devices {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since all we do with devices is iterate over it, and build_i2c::device_descriptions() returns an iterator, i think the collect is unnecessary:

Suggested change
let devices = build_i2c::device_descriptions()
.collect::<Vec<_>>();
let mut pmbus_rail_names = std::collections::BTreeSet::new();
let mut pmbus_rail_dupes = 0;
for dev in devices {
let devices = build_i2c::device_descriptions();
let mut pmbus_rail_names = std::collections::BTreeSet::new();
let mut pmbus_rail_dupes = 0;
for dev in devices {

or even just:

Suggested change
let devices = build_i2c::device_descriptions()
.collect::<Vec<_>>();
let mut pmbus_rail_names = std::collections::BTreeSet::new();
let mut pmbus_rail_dupes = 0;
for dev in devices {
let mut pmbus_rail_names = std::collections::BTreeSet::new();
let mut pmbus_rail_dupes = 0;
for dev in build_i2c::device_descriptions() {

i doubt the build time hit for iterating twice and allocating here is all that noticeable, but...every nanosecond counts, i guess?

fn do_pmbus() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
let out_dir = std::env::var("OUT_DIR")?;
let dest_path =
std::path::Path::new(&out_dir).join("pmbus_mapping.rs");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unimportant turbo-nit: std::path::PathBuf is imported at the top of the file, so perhaps we should just change that import to use std::path::{PathBuf, Path} and change this to

Suggested change
std::path::Path::new(&out_dir).join("pmbus_mapping.rs");
Path::new(&out_dir).join("pmbus_mapping.rs");

Comment thread build/i2c/src/lib.rs Outdated

for (rail, (device, index)) in &all {
// if we update this code to be more clever than just to-lowercase'ing the rail names,
// you might need to go update the mapping in `validate-api`!
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is now in the control-plane-agent build script...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants