Skip to content

[fm] resource deletion: tombstones, version guard, tracker-based GC#10368

Open
mergeconflict wants to merge 2 commits intomainfrom
mergeconflict/delete-fm-created-resources
Open

[fm] resource deletion: tombstones, version guard, tracker-based GC#10368
mergeconflict wants to merge 2 commits intomainfrom
mergeconflict/delete-fm-created-resources

Conversation

@mergeconflict
Copy link
Copy Markdown
Contributor

This PR lets us delete resources that were created by executing a FM sitrep. This turns out to be tricky, so there are a few parts to it. First, two new bits of data we persist in CRDB:

  • We introduce a new singleton fm_rendezvous_progress table that identifies the latest sitrep version that was successfully, completely executed.
  • We also introduce a time_deleted tombstone on alerts and support bundles. When an alert or bundle that was created by FM is deleted (out of band, say, by an operator), we tombstone it rather than hard-deleting.

We use these new tools together as follows:

  • First, the progress tracker, used as a guard in SitrepVersionGuardedInsert, prevents us from attempting to insert resources in stale sitreps that were already previously executed.
  • Next, the tombstone, along with our use of ON CONFLICT DO NOTHING in the create transaction, prevents us from inadvertently resurrecting a deleted resource.
  • Finally, the two work together: we know that we can hard-delete a tombstoned resource when no sitrep version greater than or equal to the progress tracker contains an AlertRequest or SupportBundleRequest for that resource.

I thought at first that these two mechanisms must be redundant, but they aren't. Without the progress tracker, it would never be safe to hard-delete a tombstoned resource, because any Nexus instance can have an arbitrarily stale sitrep in memory. And without the tombstones, newer sitrep versions that have copied forward a previously executed AlertRequest or SupportBundleRequest would resurrect deleted resources.

Closes #10248.

@mergeconflict mergeconflict self-assigned this May 4, 2026
@mergeconflict mergeconflict added the fault-management Everything related to the fault-management initiative (RFD480 and others) label May 4, 2026
Comment on lines +175 to +176
conn.batch_execute_async(ALLOW_FULL_TABLE_SCAN_SQL).await?;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production code really should not be doing full table scans - this is a lever we've permitted for tests, but it's something we really need to avoid in queries a customer might issue.

It violates the principle from RFD 192:

Database queries should be scalable, by which we mean the latency should be pretty consistent even as the database grows.

Comment on lines +146 to +147
// without one. Trouble is, it looks for all tombstoned rows on the target
// table, which -- no matter what tricks I tried with indexes and shuffling
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a problem! I don't think we should merge this as-is.

This should be doable if we have sufficient indices on the target table - only scanning through the "deleted rows", or a fraction of the deleted rows (e.g., ordered by time), with appropriate indices SHOULD NOT require a full-table scan.

We have examples of several tests which use explain_async, from nexus/db-queries/src/db/explain.rs, to EXPLAIN the query, and validate that it does not contain FULL SCAN. For example: explain_rx_list_resendable_events - that's a pretty quick feedback loop for fixing/testing this.

Comment thread schema/crdb/dbinit.sql
);

-- Singleton tracker recording the highest sitrep version that the FM rendezvous
-- background task has fully processed on any Nexus. This has two related uses:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, this means that if any of the rendezvous/execution tasks in a sitrep encounter an error, we'll be unable to mark it as "fully processed"? this seems like a pretty significant downside

Comment on lines +234 to +236
/// A cute byproduct of this design is that passing a version of `0` allows
/// callers to read the current tracker value without modifying it. This is
/// used in some tests.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I kinda think this is a smell - it's requiring modify access to the db, and it couuuuuld be increasing contention depending on when CRDB is locking?

I agree that it's "probably fine" for tests but I would not recommend doing this if anything outside tests wanted to read this value

Comment on lines +123 to +127
// requested alerts or support bundles would potentially be missed. The
// consequence is that, if there is a persistent failure here and the
// tracker never advances, the tombstone sweeps below may never be able
// to clean up resources.
let any_errors = !alerts.details.errors.is_empty()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this rendezvous task will grow significantly in complexity over time.

I understand the inclination to have "one generation representing the full completion of a sitrep", but I think the qualifier for "has a sitrep successfully executed" will get messier and more complex. There may be chunks of it, on an otherwise functioning system, that don't complete successfully for a long time!

In #10248, when I described generation numbers, I mentioned the following:

When updating a database row - like "fm_alert" - create an auxiliary table which is called something like "fm_alert_metadata". This table should contain the generation number of the sitrep (or an equivalent "only increasing" generation for the set of alerts emitted by that sitrep).

I understand that this PR is trying to do something similar, with a bit of a shortcut: instead of one-per-table generation numbers, it's trying to mash "everything in the sitrep together" and represent that via fm_rendezvous_progress. But I want to provide warning -- doing so introduces a fair bit of unnecessary coupling.

Here's an example: a customer has wired up the notification system for webhooks, telling them when something has gone wrong on the rack. If this is configured incorrectly - e.g., hits an endpoint we can't reach - the error means we'll be unable to do any garbage collection across any rows, because latest_processed_sitrep_version will be stuck and unable to move.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, because this table is generic, it seems really important to me that it fully encapsulates every step the sitrep rendezvous is doing. I think it's "technically correct" today, but adding a new step and forgetting to check for !new_step.details.errors.is_empty() would be a big problem - especially because the latest_processed_sitrep_version is generic enough to imply "everything completed".

Comment thread schema/crdb/dbinit.sql
Comment on lines +7763 to +7764
-- enforced as such: the referenced row may be GC'd out of history
-- (#9384) while this tracker is still meaningful, and the bootstrap
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is still meaningful" in the case of GC'-ing the sitrep_history row has a bunch of asterisks here

If we remove the row from fm_sitrep_history, any tombstoned rows introduced in that sitrep would be hard-deleted immediately, because the NOT(EXISTS(...)) call in the sweep would be empty

Comment thread schema/crdb/dbinit.sql
Comment on lines +3161 to +3166
-- Tombstone timestamp: set when a bundle is soft-deleted. We currently
-- soft-delete only support bundles that were requested by fault management
-- (that is, when fm_case_id is set) so that they aren't accidentally
-- resurrected after deletion. Bundles created by operators (without
-- fm_case_id) can be hard-deleted.
time_deleted TIMESTAMPTZ
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this column, the old code assumed that "if a support_bundle row exists, it is not deleted". This change breaks that contract - and will require all "readers of support bundles" to check the new time_deleted column.

I found (at least) the following spots where we are reading support_bundles, but not using this field, which could result in "using support_bundles that should have been deleted":

  • nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_list
  • nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_list_assigned_to_nexus
  • nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_fail_expunged (when reading bundles_with_bad_nexuses)
  • nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_update
  • nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_update_user_comment
  • nexus/db-lookup/src/lookup.rs: Uses lookup_resource!, but says soft_deletes = false for Support Bundles, which is wrong.

|| !marking.details.errors.is_empty();
if any_subtask_errors {
println!(
"(i) tracker advance skipped: subtasks recorded \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this below, but I find the term "tracker" vague, and hard to parse here without knowing the precise context of this PR: Is it tracking the most recent sitrep? The sitrep we tried to execute? The sitrep which fully executed? That nuance matters, and "tracker" omits it

.map_err(|e| {
public_error_from_diesel(e, ErrorHandler::Server)
})?;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about making this deletion instead do:

  • Read the row (return Ok() if it doesn't exist, or has been soft-deleted)
  • Decide if it's FM managed
  • Either UPDATE or DELETE, but not both?

I've tried to read through this current implementation as it stands, and I think it's correct (as long as case_id is immutable), but it just seemed odd to me to have the "two mutations" back-to-back. But maybe I'm over-indexing on this.

latest_processed_sitrep_version,
};

// Sweep eligible tombstoned alerts and support bundles.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth acknowledging somewhere: successful sweeping of tombstones is not actually necessary to advance latest_processed_sitrep_version, even though it does happen as a part of rendezvous?

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think @smklein's review already hit the most important issues:

  1. We should be paginating the deletion queries to avoid full table scans
  2. We should probably be tracking execution progress on the basis of individual resource types (alerts, support bundles, etc), rather than across all tables that FM rendezvous might ever touch
  3. I think we should definitely be using the CAST sentinel for failing the CTE, rather than the divide by zero sentinel, so that we can have some more confidence that we are not failing the CTE because we actually divided by zero or something.

Besides that, I left a handful of smaller suggestions, but I think Sean got all the most important ones.

Comment on lines +3742 to +3749
Some(id) => {
println!(" current sitrep: {id}");
match latest_processed_sitrep_version {
Some(v) => {
// We show this line when `fm_rendezvous_progress_advance`
// succeeded, meaning that either the sitrep successfully
// executed, or it was stale and the work was skipped.
println!(" latest fully-processed sitrep version: {v}",)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small note here: i notice that here we are printing the current sitrep's ID, and the version of the last fully executed sitrep...this is a bit unhelpful, as the reader cannot tell whether or not those refer to the same sitrep without having to run more commands to figure out which sitrep the latest executed version is. what do you think about maybe changing the FmRendezvousStatus type to include both the current sitrep's ID and version, so that we can print something like:

    current sitrep: f5391283-aa51-4abc-bee3-b4181a056ebc
      version: 69
      latest fully-processed sitrep version: 42

const REQUESTED_THIS_SITREP: &str = " requested in this sitrep:";
const CREATED: &str = " created in this activation:";
const ALREADY_CREATED: &str = " already created:";
const STALE_SITREP: &str = " skipped (stale sitrep):";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not totally sure what "skipped (stale sitrep)" means --- is this referring to the count of requested entities which were not created because they were already deleted in a future sitrep? it would be nice if this was a bit clearer about its meaning...

Comment on lines +54 to 57
/// Only meaningful when `case_id` is `Some` (FM-created alerts).
/// Non-FM alerts continue to hard-delete and never have this set.
pub time_deleted: Option<DateTime<Utc>>,
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that this is true (or necessary?), since we don't really have any other code paths that delete alerts at this time. We might end up also having those code paths soft-delete alert records as well. I don't think it's necessarily worth designing for both soft- and hard-deletes depending on where the alert came from at this point?

&self,
opctx: &OpContext,
alert: Alert,
fm_sitrep_version: Option<i64>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this should probably be a Generation rather than an i64 --- this would allow the user to pass 0 or -69420, neither of which are valid sitrep versions?

opctx: &OpContext,
id: AlertUuid,
) -> Result<(), Error> {
let conn = self.pool_connection_authorized(opctx).await?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably my fault for not doing an authz check in alert_create 😬

It's worth noting that alerts are authz resources:

authz_resource! {
name = "Alert",
parent = "Fleet",
primary_key = { uuid_kind = AlertKind },
roles_allowed = false,
polar_snippet = FleetChild,
}

so we should be able to do authz checks for them. for instance, alert_delete() should probably take an authz::Alert rather than an AlertUuid...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deleting things that FM creates during execution

3 participants