[fm] resource deletion: tombstones, version guard, tracker-based GC#10368
[fm] resource deletion: tombstones, version guard, tracker-based GC#10368mergeconflict wants to merge 2 commits intomainfrom
Conversation
| conn.batch_execute_async(ALLOW_FULL_TABLE_SCAN_SQL).await?; | ||
|
|
There was a problem hiding this comment.
Production code really should not be doing full table scans - this is a lever we've permitted for tests, but it's something we really need to avoid in queries a customer might issue.
It violates the principle from RFD 192:
Database queries should be scalable, by which we mean the latency should be pretty consistent even as the database grows.
| // without one. Trouble is, it looks for all tombstoned rows on the target | ||
| // table, which -- no matter what tricks I tried with indexes and shuffling |
There was a problem hiding this comment.
Yeah, this is a problem! I don't think we should merge this as-is.
This should be doable if we have sufficient indices on the target table - only scanning through the "deleted rows", or a fraction of the deleted rows (e.g., ordered by time), with appropriate indices SHOULD NOT require a full-table scan.
We have examples of several tests which use explain_async, from nexus/db-queries/src/db/explain.rs, to EXPLAIN the query, and validate that it does not contain FULL SCAN. For example: explain_rx_list_resendable_events - that's a pretty quick feedback loop for fixing/testing this.
| ); | ||
|
|
||
| -- Singleton tracker recording the highest sitrep version that the FM rendezvous | ||
| -- background task has fully processed on any Nexus. This has two related uses: |
There was a problem hiding this comment.
Just to confirm, this means that if any of the rendezvous/execution tasks in a sitrep encounter an error, we'll be unable to mark it as "fully processed"? this seems like a pretty significant downside
| /// A cute byproduct of this design is that passing a version of `0` allows | ||
| /// callers to read the current tracker value without modifying it. This is | ||
| /// used in some tests. |
There was a problem hiding this comment.
Nit: I kinda think this is a smell - it's requiring modify access to the db, and it couuuuuld be increasing contention depending on when CRDB is locking?
I agree that it's "probably fine" for tests but I would not recommend doing this if anything outside tests wanted to read this value
| // requested alerts or support bundles would potentially be missed. The | ||
| // consequence is that, if there is a persistent failure here and the | ||
| // tracker never advances, the tombstone sweeps below may never be able | ||
| // to clean up resources. | ||
| let any_errors = !alerts.details.errors.is_empty() |
There was a problem hiding this comment.
I believe that this rendezvous task will grow significantly in complexity over time.
I understand the inclination to have "one generation representing the full completion of a sitrep", but I think the qualifier for "has a sitrep successfully executed" will get messier and more complex. There may be chunks of it, on an otherwise functioning system, that don't complete successfully for a long time!
In #10248, when I described generation numbers, I mentioned the following:
When updating a database row - like "fm_alert" - create an auxiliary table which is called something like "fm_alert_metadata". This table should contain the generation number of the sitrep (or an equivalent "only increasing" generation for the set of alerts emitted by that sitrep).
I understand that this PR is trying to do something similar, with a bit of a shortcut: instead of one-per-table generation numbers, it's trying to mash "everything in the sitrep together" and represent that via fm_rendezvous_progress. But I want to provide warning -- doing so introduces a fair bit of unnecessary coupling.
Here's an example: a customer has wired up the notification system for webhooks, telling them when something has gone wrong on the rack. If this is configured incorrectly - e.g., hits an endpoint we can't reach - the error means we'll be unable to do any garbage collection across any rows, because latest_processed_sitrep_version will be stuck and unable to move.
There was a problem hiding this comment.
Also, because this table is generic, it seems really important to me that it fully encapsulates every step the sitrep rendezvous is doing. I think it's "technically correct" today, but adding a new step and forgetting to check for !new_step.details.errors.is_empty() would be a big problem - especially because the latest_processed_sitrep_version is generic enough to imply "everything completed".
| -- enforced as such: the referenced row may be GC'd out of history | ||
| -- (#9384) while this tracker is still meaningful, and the bootstrap |
There was a problem hiding this comment.
"is still meaningful" in the case of GC'-ing the sitrep_history row has a bunch of asterisks here
If we remove the row from fm_sitrep_history, any tombstoned rows introduced in that sitrep would be hard-deleted immediately, because the NOT(EXISTS(...)) call in the sweep would be empty
| -- Tombstone timestamp: set when a bundle is soft-deleted. We currently | ||
| -- soft-delete only support bundles that were requested by fault management | ||
| -- (that is, when fm_case_id is set) so that they aren't accidentally | ||
| -- resurrected after deletion. Bundles created by operators (without | ||
| -- fm_case_id) can be hard-deleted. | ||
| time_deleted TIMESTAMPTZ |
There was a problem hiding this comment.
Without this column, the old code assumed that "if a support_bundle row exists, it is not deleted". This change breaks that contract - and will require all "readers of support bundles" to check the new time_deleted column.
I found (at least) the following spots where we are reading support_bundles, but not using this field, which could result in "using support_bundles that should have been deleted":
- nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_list
- nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_list_assigned_to_nexus
- nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_fail_expunged (when reading
bundles_with_bad_nexuses) - nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_update
- nexus/db-queries/src/db/datastore/support_bundle.rs: support_bundle_update_user_comment
- nexus/db-lookup/src/lookup.rs: Uses
lookup_resource!, but sayssoft_deletes = falsefor Support Bundles, which is wrong.
| || !marking.details.errors.is_empty(); | ||
| if any_subtask_errors { | ||
| println!( | ||
| "(i) tracker advance skipped: subtasks recorded \ |
There was a problem hiding this comment.
I mentioned this below, but I find the term "tracker" vague, and hard to parse here without knowing the precise context of this PR: Is it tracking the most recent sitrep? The sitrep we tried to execute? The sitrep which fully executed? That nuance matters, and "tracker" omits it
| .map_err(|e| { | ||
| public_error_from_diesel(e, ErrorHandler::Server) | ||
| })?; | ||
| } |
There was a problem hiding this comment.
WDYT about making this deletion instead do:
- Read the row (return
Ok()if it doesn't exist, or has been soft-deleted) - Decide if it's FM managed
- Either UPDATE or DELETE, but not both?
I've tried to read through this current implementation as it stands, and I think it's correct (as long as case_id is immutable), but it just seemed odd to me to have the "two mutations" back-to-back. But maybe I'm over-indexing on this.
| latest_processed_sitrep_version, | ||
| }; | ||
|
|
||
| // Sweep eligible tombstoned alerts and support bundles. |
There was a problem hiding this comment.
probably worth acknowledging somewhere: successful sweeping of tombstones is not actually necessary to advance latest_processed_sitrep_version, even though it does happen as a part of rendezvous?
hawkw
left a comment
There was a problem hiding this comment.
Overall, I think @smklein's review already hit the most important issues:
- We should be paginating the deletion queries to avoid full table scans
- We should probably be tracking execution progress on the basis of individual resource types (alerts, support bundles, etc), rather than across all tables that FM rendezvous might ever touch
- I think we should definitely be using the CAST sentinel for failing the CTE, rather than the divide by zero sentinel, so that we can have some more confidence that we are not failing the CTE because we actually divided by zero or something.
Besides that, I left a handful of smaller suggestions, but I think Sean got all the most important ones.
| Some(id) => { | ||
| println!(" current sitrep: {id}"); | ||
| match latest_processed_sitrep_version { | ||
| Some(v) => { | ||
| // We show this line when `fm_rendezvous_progress_advance` | ||
| // succeeded, meaning that either the sitrep successfully | ||
| // executed, or it was stale and the work was skipped. | ||
| println!(" latest fully-processed sitrep version: {v}",) |
There was a problem hiding this comment.
small note here: i notice that here we are printing the current sitrep's ID, and the version of the last fully executed sitrep...this is a bit unhelpful, as the reader cannot tell whether or not those refer to the same sitrep without having to run more commands to figure out which sitrep the latest executed version is. what do you think about maybe changing the FmRendezvousStatus type to include both the current sitrep's ID and version, so that we can print something like:
current sitrep: f5391283-aa51-4abc-bee3-b4181a056ebc
version: 69
latest fully-processed sitrep version: 42
| const REQUESTED_THIS_SITREP: &str = " requested in this sitrep:"; | ||
| const CREATED: &str = " created in this activation:"; | ||
| const ALREADY_CREATED: &str = " already created:"; | ||
| const STALE_SITREP: &str = " skipped (stale sitrep):"; |
There was a problem hiding this comment.
i'm not totally sure what "skipped (stale sitrep)" means --- is this referring to the count of requested entities which were not created because they were already deleted in a future sitrep? it would be nice if this was a bit clearer about its meaning...
| /// Only meaningful when `case_id` is `Some` (FM-created alerts). | ||
| /// Non-FM alerts continue to hard-delete and never have this set. | ||
| pub time_deleted: Option<DateTime<Utc>>, | ||
| } |
There was a problem hiding this comment.
I'm not convinced that this is true (or necessary?), since we don't really have any other code paths that delete alerts at this time. We might end up also having those code paths soft-delete alert records as well. I don't think it's necessarily worth designing for both soft- and hard-deletes depending on where the alert came from at this point?
| &self, | ||
| opctx: &OpContext, | ||
| alert: Alert, | ||
| fm_sitrep_version: Option<i64>, |
There was a problem hiding this comment.
nit: I think this should probably be a Generation rather than an i64 --- this would allow the user to pass 0 or -69420, neither of which are valid sitrep versions?
| opctx: &OpContext, | ||
| id: AlertUuid, | ||
| ) -> Result<(), Error> { | ||
| let conn = self.pool_connection_authorized(opctx).await?; |
There was a problem hiding this comment.
this is probably my fault for not doing an authz check in alert_create 😬
It's worth noting that alerts are authz resources:
omicron/nexus/auth/src/authz/api_resources.rs
Lines 1795 to 1801 in 61a6d60
so we should be able to do authz checks for them. for instance,
alert_delete() should probably take an authz::Alert rather than an AlertUuid...
This PR lets us delete resources that were created by executing a FM sitrep. This turns out to be tricky, so there are a few parts to it. First, two new bits of data we persist in CRDB:
fm_rendezvous_progresstable that identifies the latest sitrep version that was successfully, completely executed.time_deletedtombstone on alerts and support bundles. When an alert or bundle that was created by FM is deleted (out of band, say, by an operator), we tombstone it rather than hard-deleting.We use these new tools together as follows:
SitrepVersionGuardedInsert, prevents us from attempting to insert resources in stale sitreps that were already previously executed.ON CONFLICT DO NOTHINGin the create transaction, prevents us from inadvertently resurrecting a deleted resource.AlertRequestorSupportBundleRequestfor that resource.I thought at first that these two mechanisms must be redundant, but they aren't. Without the progress tracker, it would never be safe to hard-delete a tombstoned resource, because any Nexus instance can have an arbitrarily stale sitrep in memory. And without the tombstones, newer sitrep versions that have copied forward a previously executed
AlertRequestorSupportBundleRequestwould resurrect deleted resources.Closes #10248.