[fm] resource deletion: tombstones, version guard, tracker-based GC by mergeconflict · Pull Request #10368 · oxidecomputer/omicron

mergeconflict · 2026-05-04T16:48:43Z

This PR lets us delete resources that were created by executing a FM sitrep. This turns out to be tricky, so there are a few parts to it. First, two new bits of data we persist in CRDB:

We introduce a new singleton fm_rendezvous_progress table that identifies the latest sitrep version that was successfully, completely executed.
We also introduce a time_deleted tombstone on alerts and support bundles. When an alert or bundle that was created by FM is deleted (out of band, say, by an operator), we tombstone it rather than hard-deleting.

We use these new tools together as follows:

First, the progress tracker, used as a guard in SitrepVersionGuardedInsert, prevents us from attempting to insert resources in stale sitreps that were already previously executed.
Next, the tombstone, along with our use of ON CONFLICT DO NOTHING in the create transaction, prevents us from inadvertently resurrecting a deleted resource.
Finally, the two work together: we know that we can hard-delete a tombstoned resource when no sitrep version greater than or equal to the progress tracker contains an AlertRequest or SupportBundleRequest for that resource.

I thought at first that these two mechanisms must be redundant, but they aren't. Without the progress tracker, it would never be safe to hard-delete a tombstoned resource, because any Nexus instance can have an arbitrarily stale sitrep in memory. And without the tombstones, newer sitrep versions that have copied forward a previously executed AlertRequest or SupportBundleRequest would resurrect deleted resources.

Closes #10248.

smklein · 2026-05-04T19:04:43Z

+                conn.batch_execute_async(ALLOW_FULL_TABLE_SCAN_SQL).await?;
+


Production code really should not be doing full table scans - this is a lever we've permitted for tests, but it's something we really need to avoid in queries a customer might issue.

It violates the principle from RFD 192:

Database queries should be scalable, by which we mean the latency should be pretty consistent even as the database grows.

smklein · 2026-05-04T19:15:41Z

+// without one. Trouble is, it looks for all tombstoned rows on the target
+// table, which -- no matter what tricks I tried with indexes and shuffling


Yeah, this is a problem! I don't think we should merge this as-is.

This should be doable if we have sufficient indices on the target table - only scanning through the "deleted rows", or a fraction of the deleted rows (e.g., ordered by time), with appropriate indices SHOULD NOT require a full-table scan.

We have examples of several tests which use explain_async, from nexus/db-queries/src/db/explain.rs, to EXPLAIN the query, and validate that it does not contain FULL SCAN. For example: explain_rx_list_resendable_events - that's a pretty quick feedback loop for fixing/testing this.

smklein · 2026-05-04T19:18:22Z

 );

+-- Singleton tracker recording the highest sitrep version that the FM rendezvous
+-- background task has fully processed on any Nexus. This has two related uses:


Just to confirm, this means that if any of the rendezvous/execution tasks in a sitrep encounter an error, we'll be unable to mark it as "fully processed"? this seems like a pretty significant downside

smklein · 2026-05-04T19:20:47Z

+    /// A cute byproduct of this design is that passing a version of `0` allows
+    /// callers to read the current tracker value without modifying it. This is
+    /// used in some tests.


Nit: I kinda think this is a smell - it's requiring modify access to the db, and it couuuuuld be increasing contention depending on when CRDB is locking?

I agree that it's "probably fine" for tests but I would not recommend doing this if anything outside tests wanted to read this value

smklein · 2026-05-04T19:29:19Z

+        // requested alerts or support bundles would potentially be missed. The
+        // consequence is that, if there is a persistent failure here and the
+        // tracker never advances, the tombstone sweeps below may never be able
+        // to clean up resources.
+        let any_errors = !alerts.details.errors.is_empty()


I believe that this rendezvous task will grow significantly in complexity over time.

I understand the inclination to have "one generation representing the full completion of a sitrep", but I think the qualifier for "has a sitrep successfully executed" will get messier and more complex. There may be chunks of it, on an otherwise functioning system, that don't complete successfully for a long time!

In #10248, when I described generation numbers, I mentioned the following:

When updating a database row - like "fm_alert" - create an auxiliary table which is called something like "fm_alert_metadata". This table should contain the generation number of the sitrep (or an equivalent "only increasing" generation for the set of alerts emitted by that sitrep).

I understand that this PR is trying to do something similar, with a bit of a shortcut: instead of one-per-table generation numbers, it's trying to mash "everything in the sitrep together" and represent that via fm_rendezvous_progress. But I want to provide warning -- doing so introduces a fair bit of unnecessary coupling.

Here's an example: a customer has wired up the notification system for webhooks, telling them when something has gone wrong on the rack. If this is configured incorrectly - e.g., hits an endpoint we can't reach - the error means we'll be unable to do any garbage collection across any rows, because latest_processed_sitrep_version will be stuck and unable to move.

Also, because this table is generic, it seems really important to me that it fully encapsulates every step the sitrep rendezvous is doing. I think it's "technically correct" today, but adding a new step and forgetting to check for !new_step.details.errors.is_empty() would be a big problem - especially because the latest_processed_sitrep_version is generic enough to imply "everything completed".

smklein · 2026-05-04T20:49:48Z

+    -- enforced as such: the referenced row may be GC'd out of history
+    -- (#9384) while this tracker is still meaningful, and the bootstrap


"is still meaningful" in the case of GC'-ing the sitrep_history row has a bunch of asterisks here

If we remove the row from fm_sitrep_history, any tombstoned rows introduced in that sitrep would be hard-deleted immediately, because the NOT(EXISTS(...)) call in the sweep would be empty

smklein · 2026-05-04T20:59:05Z

+    -- Tombstone timestamp: set when a bundle is soft-deleted. We currently
+    -- soft-delete only support bundles that were requested by fault management
+    -- (that is, when fm_case_id is set) so that they aren't accidentally
+    -- resurrected after deletion. Bundles created by operators (without
+    -- fm_case_id) can be hard-deleted.
+    time_deleted TIMESTAMPTZ


Without this column, the old code assumed that "if a support_bundle row exists, it is not deleted". This change breaks that contract - and will require all "readers of support bundles" to check the new time_deleted column.

I found (at least) the following spots where we are reading support_bundles, but not using this field, which could result in "using support_bundles that should have been deleted":

nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_list

nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_list_assigned_to_nexus

nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_fail_expunged (when reading bundles_with_bad_nexuses)

nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_update

nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_update_user_comment

nexus/db-lookup/src/lookup.rs: Uses lookup_resource!, but says soft_deletes = false for Support Bundles, which is wrong.

smklein · 2026-05-04T21:04:01Z

+                        || !marking.details.errors.is_empty();
+                    if any_subtask_errors {
+                        println!(
+                            "(i) tracker advance skipped: subtasks recorded \


I mentioned this below, but I find the term "tracker" vague, and hard to parse here without knowing the precise context of this PR: Is it tracking the most recent sitrep? The sitrep we tried to execute? The sitrep which fully executed? That nuance matters, and "tracker" omits it

smklein · 2026-05-04T21:45:55Z

+                .map_err(|e| {
+                    public_error_from_diesel(e, ErrorHandler::Server)
+                })?;
+        }


WDYT about making this deletion instead do:

Read the row (return Ok() if it doesn't exist, or has been soft-deleted)

Decide if it's FM managed

Either UPDATE or DELETE, but not both?

I've tried to read through this current implementation as it stands, and I think it's correct (as long as case_id is immutable), but it just seemed odd to me to have the "two mutations" back-to-back. But maybe I'm over-indexing on this.

smklein · 2026-05-04T21:50:09Z

+            latest_processed_sitrep_version,
+        };
+
+        // Sweep eligible tombstoned alerts and support bundles.


probably worth acknowledging somewhere: successful sweeping of tombstones is not actually necessary to advance latest_processed_sitrep_version, even though it does happen as a part of rendezvous?

hawkw

Overall, I think @smklein's review already hit the most important issues:

We should be paginating the deletion queries to avoid full table scans
We should probably be tracking execution progress on the basis of individual resource types (alerts, support bundles, etc), rather than across all tables that FM rendezvous might ever touch
I think we should definitely be using the CAST sentinel for failing the CTE, rather than the divide by zero sentinel, so that we can have some more confidence that we are not failing the CTE because we actually divided by zero or something.

Besides that, I left a handful of smaller suggestions, but I think Sean got all the most important ones.

hawkw · 2026-05-05T17:16:58Z

+        Some(id) => {
+            println!("    current sitrep: {id}");
+            match latest_processed_sitrep_version {
+                Some(v) => {
+                    // We show this line when `fm_rendezvous_progress_advance`
+                    // succeeded, meaning that either the sitrep successfully
+                    // executed, or it was stale and the work was skipped.
+                    println!("    latest fully-processed sitrep version: {v}",)


small note here: i notice that here we are printing the current sitrep's ID, and the version of the last fully executed sitrep...this is a bit unhelpful, as the reader cannot tell whether or not those refer to the same sitrep without having to run more commands to figure out which sitrep the latest executed version is. what do you think about maybe changing the FmRendezvousStatus type to include both the current sitrep's ID and version, so that we can print something like:

current sitrep: f5391283-aa51-4abc-bee3-b4181a056ebc version: 69 latest fully-processed sitrep version: 42

hawkw · 2026-05-05T17:18:27Z

            const REQUESTED_THIS_SITREP: &str = "  requested in this sitrep:";
            const CREATED: &str = "  created in this activation:";
            const ALREADY_CREATED: &str = "  already created:";
+            const STALE_SITREP: &str = "  skipped (stale sitrep):";


i'm not totally sure what "skipped (stale sitrep)" means --- is this referring to the count of requested entities which were not created because they were already deleted in a future sitrep? it would be nice if this was a bit clearer about its meaning...

hawkw · 2026-05-05T17:24:40Z

+    /// Only meaningful when `case_id` is `Some` (FM-created alerts).
+    /// Non-FM alerts continue to hard-delete and never have this set.
+    pub time_deleted: Option<DateTime<Utc>>,
 }


I'm not convinced that this is true (or necessary?), since we don't really have any other code paths that delete alerts at this time. We might end up also having those code paths soft-delete alert records as well. I don't think it's necessarily worth designing for both soft- and hard-deletes depending on where the alert came from at this point?

hawkw · 2026-05-05T17:26:42Z

        &self,
        opctx: &OpContext,
        alert: Alert,
+        fm_sitrep_version: Option<i64>,


nit: I think this should probably be a Generation rather than an i64 --- this would allow the user to pass 0 or -69420, neither of which are valid sitrep versions?

hawkw · 2026-05-05T17:31:53Z

+        opctx: &OpContext,
+        id: AlertUuid,
+    ) -> Result<(), Error> {
+        let conn = self.pool_connection_authorized(opctx).await?;


this is probably my fault for not doing an authz check in alert_create 😬

It's worth noting that alerts are authz resources:

omicron/nexus/auth/src/authz/api_resources.rs

Lines 1795 to 1801 in 61a6d60

authz_resource! {

name = "Alert",

parent = "Fleet",

primary_key = { uuid_kind = AlertKind },

roles_allowed = false,

polar_snippet = FleetChild,

}

so we should be able to do authz checks for them. for instance, alert_delete() should probably take an authz::Alert rather than an AlertUuid...

[fm] resource deletion: tombstones, version guard, tracker-based GC

a14afff

mergeconflict requested review from hawkw, jgallagher and smklein May 4, 2026 16:48

mergeconflict self-assigned this May 4, 2026

mergeconflict added the fault-management Everything related to the fault-management initiative (RFD480 and others) label May 4, 2026

fix comment

4113a00

smklein requested changes May 4, 2026

View reviewed changes

hawkw requested changes May 5, 2026

View reviewed changes

		// without one. Trouble is, it looks for all tombstoned rows on the target
		// table, which -- no matter what tricks I tried with indexes and shuffling

		-- enforced as such: the referenced row may be GC'd out of history
		-- (#9384) while this tracker is still meaningful, and the bootstrap

	authz_resource! {
	name = "Alert",
	parent = "Fleet",
	primary_key = { uuid_kind = AlertKind },
	roles_allowed = false,
	polar_snippet = FleetChild,
	}

Conversation

mergeconflict commented May 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants