Background task: delete audit log entries older than retention period#10042
Background task: delete audit log entries older than retention period#10042david-crespo merged 17 commits intomainfrom
Conversation
| .execute_async(&*datastore.pool_connection_for_tests().await.unwrap()) | ||
| .await | ||
| .unwrap(); | ||
| } |
There was a problem hiding this comment.
This is test-only code.
|
On dogfood, we have 200k rows and it takes up 89 MB. So let's say it's about 45MB per 100k rows. Dogood |
| // Diesel's DeleteStatement doesn't support LIMIT, so we use raw | ||
| // SQL rather than the subquery workaround needed for the DSL. | ||
| diesel::sql_query( | ||
| "DELETE FROM omicron.public.audit_log \ | ||
| WHERE time_completed IS NOT NULL \ | ||
| AND time_completed < $1 \ | ||
| ORDER BY time_completed, id \ | ||
| LIMIT $2", | ||
| ) | ||
| .bind::<diesel::sql_types::Timestamptz, _>(cutoff) | ||
| .bind::<diesel::sql_types::BigInt, _>(i64::from(limit)) |
There was a problem hiding this comment.
it would be nice if, when writing raw SQL queries, we used the various test utils expectorate helpers so that we could at least get a unit test that asserts that the query is valid SQL. see, for example:
omicron/nexus/db-queries/src/db/queries/regions_hard_delete.rs
Lines 81 to 100 in d3b6e92
There was a problem hiding this comment.
I'm reluctant on this one — the integration tests cover that it's valid SQL better than a unit test since in the latter there's no guarantee the query is being used in the spot where it's supposed to be.
There was a problem hiding this comment.
That's fair, I just kinda think these are nice to have, since the unit test fails much faster than the integration tests and the error is very obviously "you have a syntax error in your SQL". But it's not important to me.
| /// The cutoff time used: completed entries older than this were eligible. | ||
| pub cutoff: DateTime<Utc>, | ||
| /// Configured max rows to delete in this activation. | ||
| pub max_delete_per_activation: u32, |
There was a problem hiding this comment.
i feel like this should be "max_deleted", but maybe that's just me?
There was a problem hiding this comment.
Perhaps "max_to_delete" or "max_deletions" ? This just based on the description above. "max_deleted" sounds like something that happened in the past no?
There was a problem hiding this comment.
Could even be something like batch_size, though that leaves open the possibility of the activation doing multiple batches.
There was a problem hiding this comment.
Perhaps "max_to_delete" or "max_deletions" ? This just based on the description above. "max_deleted" sounds like something that happened in the past no?
I think if it's max_deleted_per_activation it makes sense --- I wasn't suggesting dropping the _per_activation part.
Could even be something like
batch_size, though that leaves open the possibility of the activation doing multiple batches.
Hm, I think that --- for the reason you mentioned --- I'd rather avoid batch_size; it's the maximum number of records to delete during the activation, not the size of the individual batch in which the rows are deleted.
|
Thanks for the thorough review! Here is what I did in e587600:
|
9ed1955 to
c5b7f89
Compare
8885a81 to
284009f
Compare
|
In f58acf9 I added more explicit checking for overflow when we subtract the days from now and more detailed assertions about which entries get cleaned up and which remain after each activation. |
hawkw
left a comment
There was a problem hiding this comment.
The new tests are much nicer, thanks! I had some admittedly pedantic complaints about the date overflow checking stuff. Sorry.
| let cutoff = match cutoff { | ||
| Some(c) => c, | ||
| None => { | ||
| let msg = format!( | ||
| "retention_days {} overflows date arithmetic", | ||
| self.retention_days, | ||
| ); | ||
| slog::error!(&opctx.log, "{msg}"); | ||
| return AuditLogCleanupStatus { | ||
| rows_deleted: 0, | ||
| cutoff: Utc::now(), | ||
| max_deleted_per_activation: self.max_deleted_per_activation, | ||
| error: Some(msg), | ||
| }; | ||
| } | ||
| }; |
There was a problem hiding this comment.
Hm, I kind of wonder if what we would actually want to do here is to saturate to the longest representable retention period, instead of just giving up. But, this would lead to us deleting rows that are actually younger than the configured retention period, which also feels a bit bad. So, maybe it's correct to just give up, in the hopes that eventually the cutoff period becomes representable as time moves forwards.
| TimeDelta::try_days(i64::from(self.retention_days.get())) | ||
| .expect("retention_days fits in TimeDelta"); | ||
| let cutoff = Utc::now() - retention_delta; | ||
| let cutoff = TimeDelta::try_days(i64::from(self.retention_days.get())) |
There was a problem hiding this comment.
so, I don't love that the conversion of self.retention_days to a TimeDelta is part of the same fallible conversion that we attempt every time the task activates. it seems like ideally, the calculation involving the fixed value from the config file failed once when we parsed the config and was treated as an invalid config file, and only the subtraction of the current date from the time delta was fallible every time the task activates. if self.retention_days is not representable as a TimeDelta of days, this will never succeed, which should be treated as an invalid config.
I see the retention period is now 30 days, so "short" from this original comment is a little out of date, but I wanted to mention that I think the longer retention is important for giving our customers the room to notice and fix audit log fetch automation they may have. Or even set it up in the first place after initial rack installation. Unless we're worried about the size of the audit log taking up too much space I recommend extending that even further, such as 90 days, 180 days, or even a year, considering this isn't tunable. |
|
Bumped to 90 after talking offline, and we'll plan to make this configurable. |
49e49dc to
4609d15
Compare
| let retention = TimeDelta::try_days(i64::from(retention_days.get())) | ||
| .expect("retention_days must be representable as a TimeDelta"); |
There was a problem hiding this comment.
nitpicky: it would be nice to include the invalid value in the panic message, something like:
| let retention = TimeDelta::try_days(i64::from(retention_days.get())) | |
| .expect("retention_days must be representable as a TimeDelta"); | |
| let Some(retention) = TimeDelta::try_days(i64::from(retention_days.get())) else { | |
| panic!("invalid retention_days {retention_days} (must be representable as a TimeDelta)"); | |
| }; |
or something like that --- it would maybe be nicer to also include the actual max value but i was too lazy to calculate what it is.
There was a problem hiding this comment.
i also felt a bit sketched out about the i64::from and worried that we might be accidentally making it into a negative TimeDelta, but then realized this couldn't happen since retention_days` is a u32, so...it's fine, but maybe a comment about that would be nice?
| let msg = format!("audit log cleanup failed: {err:#}"); | ||
| slog::error!(&opctx.log, "{msg}"); |
There was a problem hiding this comment.
nitpicky, you're welcome to ignore this: i feel like audit log cleanup failed: is useful to have in the Nexus logs (even though they're also tagged with the name of the background task), but is maybe a bit less useful in the omdb db background-tasks status output, since...you asked for the status of the audit log cleanup task, you know that if there's an error, it means that was the task that failed, so it just kinda makes the line longer?
| let msg = format!("audit log cleanup failed: {err:#}"); | |
| slog::error!(&opctx.log, "{msg}"); | |
| let msg = format!("{err:#}"); | |
| slog::error!(&opctx.log, "audit log cleanup failed: {msg}"); |
There was a problem hiding this comment.
also, i think generally when we log errors in a slog log, we try to log them as structured "err" => ... fields and use the slog_error_chain crate?
There was a problem hiding this comment.
Good tips, I did em. Regarding the error chain, I don't think it currently does anything because nothing in the error we get from the datastore function uses #[source] to nest other errors inside it. But it still seemed smart to use InlineErrorChain so that it automatically gets logged correctly in the future. slog::error! with the structured thing apparently already does this.
hawkw
left a comment
There was a problem hiding this comment.
thanks for the changes to the retention period validation, i think this is basically good. i had a couple minor nits.
da4abcb to
deb68bb
Compare
Closes #8818
The audit log table grows unbounded. Here we prevent that by deleting old entries. We expect customers who rely on the audit log to be fetching the log and putting the result in some external monitoring system on a very frequent interval, so the retention period could be quite short. Here I am proposing 7 days, but it could probably be 3. It's fine as long as we make clear that if you want to rely on it you have to fetch it and store it yourself.
Tunables
audit_log_cleanup.period_secs= 600audit_log_cleanup.retention_days= 90audit_log_cleanup.max_delete_per_activation= 10000Table size on colo
5M total, 385k in the past week. At 10000 per activation running every 10 minutes, it would take 83 hours to clean up the whole table. That seems ok to me. After that it will be much easier. 10k entries per ten minute period would be a lot.
Adding up the ranges, it's about 1336 MB, or about 26MB per 100k rows. On dogfood it's more like 45MB per 100k rows (see comment below). I would expect the latter to be more typical — on colo we have a ton of rows due to automation that does a ton of login requests with no silo or actor ID.