Skip to content

[fm] Filter ereports during preparation by 'known-to-Nexus'#10349

Open
smklein wants to merge 18 commits intomainfrom
fm-known-ereports
Open

[fm] Filter ereports during preparation by 'known-to-Nexus'#10349
smklein wants to merge 18 commits intomainfrom
fm-known-ereports

Conversation

@smklein
Copy link
Copy Markdown
Collaborator

@smklein smklein commented Apr 30, 2026

Before this PR, the preparation phase of FM loaded "all unmarked ereports". Unfortunately, if this set contains any ereports not-known to Nexus (which is, today, all of them) this set will grow in an unbounded fashion. This is problematic, as the only good FM subsystem is a stable FM subsystem.

To mitigate: this PR adds a set of "known ereport classes", as a big static array. Once we begin parsing / using these ereport classes, we can (and should!) add to this list. Thus: we'll only pull in ereports we can actually use.

Although this leaves the FM machinery up and running (and continuing to scan) it should also mean that we can continue to work on FM incrementally: if we even add a single "ereport class" which we want to query, we should be able to start parsing them, marking them "seen", and we'll have appropriate backpressure on the loading side of things.

Additionally, this PR adds omdb db ereport classes, which will help us analyze the set of known/unknown ereports. This has output like the following (though my sample contents here are, admittedly, totally made up):

$ omdb db ereport classes
  KNOWN-TO-NEXUS CLASS                                TOTAL UNMARKED
  no             ereport.cpu.amd.l3_cache_correctable 2681  2681
  no             ereport.fan.over_temp                2674  2674
  no             ereport.psu.under_voltage            2323  2323
  no             ereport.dimm.ce_threshold            1947  1947
  no             ereport.nic.link_flap                1937  1937
  no             ereport.thermal.fan_stuck            1832  1832
  no             ereport.smbus.timeout                1729  1729
  no             ereport.data_loss.possible           1466  1466
  no             ereport.data_loss.certain            1439  1439
  yes            ereport.example.handled              50    0
  excluded       (NULL)                               12    12

  Known classes with no rows in the database:
    ereport.example.never_seen

Fixes #10348

@hawkw
Copy link
Copy Markdown
Member

hawkw commented Apr 30, 2026

Additionally, this PR adds omdb db ereport classes, which will help us analyze the set of known/unknown ereports. This has output like the following (though my sample contents here are, admittedly, totally made up):

neat!

@smklein smklein marked this pull request as ready for review April 30, 2026 22:02
@smklein smklein requested review from hawkw and mergeconflict and removed request for hawkw April 30, 2026 22:02
Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall i like this approach a lot! i had a bunch of annoying but ultimately unimportant nitpicks.

Comment thread dev-tools/omdb/src/bin/omdb/db/ereport.rs Outdated
Comment thread dev-tools/omdb/src/bin/omdb/db/ereport.rs Outdated
Comment thread dev-tools/omdb/src/bin/omdb/db/ereport.rs Outdated
Comment thread dev-tools/omdb/src/bin/omdb/db/ereport.rs Outdated
Comment thread dev-tools/omdb/tests/usage_errors.out Outdated
Comment thread nexus/db-queries/src/db/datastore/ereport.rs Outdated
Comment thread nexus/src/app/background/tasks/fm_analysis.rs Outdated
Comment thread nexus/src/app/background/tasks/fm_analysis.rs Outdated
Comment thread nexus/src/app/background/tasks/fm_analysis.rs Outdated
Comment thread nexus/types/src/fm/ereport.rs Outdated
/// a few hundred entries; past ~1000 entries, prefer either prefix matching
/// (`class LIKE 'ereport.cpu.amd.%'`) or a `known_ereport_class` lookup
/// table joined into the query. Revisit this if the list grows that large.
pub fn known_ereport_classes() -> &'static [&'static str] {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels a bit weird to me that this is in nexus_types::fm rather than in nexus_fm, which is where the diagnosis engine implementations will end up living? i would kind of expect this list to eventually be composed by concatenating lists of ereport classes from each DE module or something...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally put it here so that omdb could depend on it without adding a new dependency on nexus_fm

How strong is your preference to pull it out? Would you prefer omdb to have that direct dependency?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, hmm. well, i feel in the future this list is probably gonna end up coming from nexus-fm, unless we're happy just having one big list that any future change has to remember to update. i would really much rather break it out by DE, but 🤷‍♀️

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a thought: what if, rather than relying on omdb and Nexus having been built from the same commit, we made this a lockstep API that omdb could use to ask Nexus for the list of which ereports its DEs care about? That breaks the build-time dependency issue and also avoids lying when the omdb and Nexus versions are different. It's more work, though.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c5c53c2, with a fall-back to just showing "?" if we don't know.

It seems pretty straightforward to do this, and the benefit of "not getting omdb/nexus versions mixed up" seems worth it.

f.write_str(match self {
Self::Yes => "yes",
Self::No => "no",
Self::NullClass => "NullClass",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ew. i feel like this could just be

Suggested change
Self::NullClass => "NullClass",
Self::NullClass => "-",

or

Suggested change
Self::NullClass => "NullClass",
Self::NullClass => "n/a",

especially 'cause the column ought to be only 3 characters wide otherwise

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5c5d500

Comment on lines +538 to +539
// The variable-length `class` column goes last so that wrapping on a
// narrow terminal doesn't disrupt the fixed-width numeric columns.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like it ought to be a comment on the field...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, moved in 5c5d500

#[derive(Tabled)]
#[tabled(rename_all = "SCREAMING_SNAKE_CASE")]
struct ClassRow<'a> {
#[tabled(rename = "KNOWN-TO-OMDB")]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything else is SCREAMING_SNAKE_CASE rather than SCREAMING-KEBAB-CASE...and i might just say "KNOWN" to save characters honestly

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +587 to +588
"Note: KNOWN-TO-OMDB reflects which classes have a diagnosis engine \
in THIS omdb\nbinary; the currently-deployed Nexus may differ if it \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird phrasing here, the omdb binary does not technically have any diagnosis engines in it. it would be correct to say "have a diagnosis engine in Nexus as of the control plane build that produced this omdb binary" or something. i would also like it if this said what version this OMDB is, but i wont stress too hard about that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also normally i think "note:" is lowercase in other omdb commands

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5c5d500 , though I also added a lockstep API in c5c53c2

(hopefully this removes the need to print the omdb version?)

Comment on lines +748 to +749
"ereport.positronic-brain.example",
"ereport.spinning-blades.example",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it

.select(db::model::Ereport::as_select())
.load_async(&*conn)
.await
.expect("failed to list unseen ereports")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't end up running afoul of the full scan rule, does it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so? I'm pretty sure we'd be blowing up with the "FULL SCAN NOT ALLOWED" issue if it was

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, FYI, this is test-only)

Comment thread nexus/types/src/fm/ereport.rs Outdated
/// a few hundred entries; past ~1000 entries, prefer either prefix matching
/// (`class LIKE 'ereport.cpu.amd.%'`) or a `known_ereport_class` lookup
/// table joined into the query. Revisit this if the list grows that large.
pub fn known_ereport_classes() -> &'static [&'static str] {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, hmm. well, i feel in the future this list is probably gonna end up coming from nexus-fm, unless we're happy just having one big list that any future change has to remember to update. i would really much rather break it out by DE, but 🤷‍♀️

@mergeconflict
Copy link
Copy Markdown
Contributor

I'm thinking about this a bit... I wonder if, in the immediate term, we should just do the simplest thing possible to avoid blowing up the logs -- you mentioned "disable the sitrep planner altogether" in #10348, which I'm guessing would be a super easy PR to land and cherry-pick into 19.x if needed.

Then, that would buy us a bit of time to think about the allowlist/filter design a bit more. My initial thought on that is that, rather than having a static array of ereport types that we know how to handle, each diagnosis engine present in the system should provide the set of ereports that it knows how to consume. We'd take these sets and union them together (on startup I guess) to create the full allowlist. That places the burden of proof of "no unhandled ereports" locally with the thing that handles them, rather than being something off to the side that we'd have to remember to update. What do you think?

Having said that, this PR totally looks good to merge (modulo expectorate) if it's the direction we want to go. Thanks for jumping on this so quickly!

@hawkw
Copy link
Copy Markdown
Member

hawkw commented May 1, 2026

I'm thinking about this a bit... I wonder if, in the immediate term, we should just do the simplest thing possible to avoid blowing up the logs -- you mentioned "disable the sitrep planner altogether" in #10348, which I'm guessing would be a super easy PR to land and cherry-pick into 19.x if needed.

My preference is to move forwards with this change rather than disabling, because that creates a separate "un disabling" step we have to take once things are fixed, and I've become wary of using the config file in cases where it isn't actually necessary. And, if we disable the fm_analysis task entirely, we still have the issue that, if we turn it on once we add a DE that cares about a small set of ereports, we're still potentially loading a bunch of other ones we don't know what to do with. Adding the list of ereports that DEs care about is something we would probably end up doing anyway even if we did reach for the blunt instrument first.

Then, that would buy us a bit of time to think about the allowlist/filter design a bit more. My initial thought on that is that, rather than having a static array of ereport types that we know how to handle, each diagnosis engine present in the system should provide the set of ereports that it knows how to consume. We'd take these sets and union them together (on startup I guess) to create the full allowlist. That places the burden of proof of "no unhandled ereports" locally with the thing that handles them, rather than being something off to the side that we'd have to remember to update. What do you think?

This is certainly my preference. But I think it makes sense to first land the changes to the queries and such that allow them to take a list of desired ereports, and then, once we have more of a framework for implementing individual DEs, we would change how the list of desired ereports classes is constructed by looking at the enabled DEs. But, we would still use the same queries and so forth in that world, too; we'd just be changing where the list of classes comes from.

I would encourage us to think of the static array in this PR as a placeholder.

@smklein
Copy link
Copy Markdown
Collaborator Author

smklein commented May 1, 2026

Then, that would buy us a bit of time to think about the allowlist/filter design a bit more. My initial thought on that is that, rather than having a static array of ereport types that we know how to handle, each diagnosis engine present in the system should provide the set of ereports that it knows how to consume. We'd take these sets and union them together (on startup I guess) to create the full allowlist. That places the burden of proof of "no unhandled ereports" locally with the thing that handles them, rather than being something off to the side that we'd have to remember to update. What do you think?

So, I'm down with this direction - and I've moved the set of "known ereports" into nexus-fm -- but I kinda want to punt on something this clever until we actually have a few ereports that we've started processing. (Don't want to generalize too early here).

I've used https://docs.rs/linkme/latest/linkme/ before to achieve something similar

I would encourage us to think of the static array in this PR as a placeholder.

Agreed; though I think it's probably fine to use it for N < ~5

@smklein
Copy link
Copy Markdown
Collaborator Author

smklein commented May 1, 2026

Also, for completeness: this can't be cherry-picked because it contains a schema change.

#10360 is the preferred (temporary) solution, which I'll revert as a part of getting this PR (#10349) merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FMA subsystem loading unbounded ereports

3 participants