Background task: complete never-completed audit log entries by david-crespo · Pull Request #10026 · oxidecomputer/omicron

david-crespo · 2026-03-11T04:36:34Z

The most important thing about an audit log is that every event you care about is in there. This PR adds a job to make sure requests that never complete (e.g., because Nexus crashed in the middle) appear in the audit log after some timeout.

The audit log gets written in two steps. At the beginning of a request we initialize the log entry row. If this fails, the request fails. This guarantees we never miss an initialization. Then, at the end of the request, we "complete" the row with the result of the operation and a time_completed. The audit log list API response is ordered by time_completed. This lets us guarantee that if you retrieve a chunk of the audit log that is fully in the past, it will never change — you never have to fetch that chunk of the log again. But this requires thate entries do not appear in the log until they are completed.

Completion can fail for variety of hopefully very rare reasons. It seemed unreasonably complicated to guarantee completion by rolling back whatever operation happened on a failed completion. Instead, the idea was that we would run a job that cleans up incomplete rows after a long enough amount of time that it's very unlikely the request will complete. When we clean up such rows, we don't know what actually happened to the request, so we have to give it a special result_kind of timeout (as opposed to success or error).

omicron/nexus/db-model/src/audit_log.rs

Lines 275 to 283 in 3d2d0c1

    
               /// This doesn't mean the operation itself timed out (which would be an 
        
               /// error, and I don't think we even have API timeouts) but rather that the 
        
               /// attempts to complete the log entry failed (or were never even attempted 
        
               /// because, e.g., Nexus crashed during the operation), and this entry had 
        
               /// to be cleaned up later by a background job (which doesn't exist yet) 
        
               /// after a timeout. Note we represent this result status as "Unknown" in 
        
               /// the external API because timeout is an implementation detail and makes 
        
               /// it sound like the operation timed out. 
        
               Timeout,

In the API response this becomes AuditLogEntryResult::Unknown because I feel like "timeout" is an implementation detail and makes it sound like the operation itself timed out.

omicron/nexus/db-model/src/audit_log.rs

Lines 413 to 415 in 3d2d0c1

    
           AuditLogResultKind::Timeout => { 
        
               audit::AuditLogEntryResult::Unknown 
        
           }

How often are log entries left incomplete?

Fortunately, this should be a very rare event — I checked on dogfood and there were around 200k rows in the table and there was exactly 1 incomplete disk_create call from October 2025. Need to check on the colo rack.

Config options

All three live in [background_tasks.audit_log_timeout_incomplete] in
the Nexus config (AuditLogTimeoutIncompleteConfig in nexus_config.rs):

period_secs — how often the background task activates (currently 600 = 10 min)
timeout_secs — how old an incomplete entry must be before it gets timed out (default 14400 = 4 hours)
max_update_per_activation — max rows updated per activation (currently 1000), following the naming convention from session_cleanup.max_delete_per_activation

On the timeout, I think 4 hours is pretty conservative. My worry is that we could have a legitimate API call that takes an hour or two because it's a post of a giant file or something. I don't know how reasonable that worry is. This timeout could probably go down to two hours or an hour.

On the interval, 10 minutes seems maybe overkill for something that happens once every 5 months? But it's just one query, basically.

Claude skill for adding background tasks

After #10009 I had Claude write the add-background-task skill and used it to make this one. It was pretty good, so I think it's probably worth checking in.

david-crespo · 2026-03-11T04:43:55Z

schema/crdb/audit-log-incomplete-timeout/up01.sql

+-- Supports "find stale incomplete rows ordered by time_started".
+CREATE INDEX IF NOT EXISTS audit_log_incomplete_by_time_started
+    ON omicron.public.audit_log (time_started, id)
+    WHERE time_completed IS NULL;


The robots in my employ pointed out that this index does have a cost — every single request that comes through causes this index to be updated because the audit log entry for it has time_completed null while the operation is in progress. That's probably worth it — otherwise I think the job would not work as written because it would have to do a table scan to find the incomplete rows.

david-crespo · 2026-03-11T04:44:51Z

nexus/src/app/background/tasks/audit_log_timeout_incomplete.rs

+                 entries (may indicate Nexus crashes or bugs)";
+                "cutoff" => %cutoff,
+                "limit" => self.max_update_per_activation,
+            );


I figured this should warn! since things showing up here should be quite unusual.

david-crespo · 2026-03-11T04:47:22Z

nexus/db-queries/src/db/datastore/audit_log.rs

-        diesel::update(audit_log::table)
+        let rows = diesel::update(audit_log::table)
            .filter(audit_log::id.eq(entry.id))
+            .filter(audit_log::time_completed.is_null())


This is really important, otherwise in the bizarro world situation where the request actually does finish after the row has been timed out, completing it would change its time_completed, violating the Rules of Audit Log. See the comment above.

david-crespo · 2026-03-11T04:50:58Z

dev-tools/omdb/src/bin/omdb/nexus.rs

+        ),
+    }
+}
+


This task isn't changed here but it was causing the tests to be flaky for me locally depending on whether I ran one test or the whole module.

david-crespo · 2026-03-11T05:23:54Z

nexus/db-queries/src/db/datastore/audit_log.rs

+            .execute_async(&*self.pool_connection_authorized(opctx).await?)
+            .await
+            .map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))
+    }


This is the core of the task: for any audit log entries older than cutoff with null time_completed, mark those as timed out. In practice the ordering and limit should not matter because the number of timed out requests should be small.

david-crespo · 2026-03-11T15:06:36Z

The numbers from colo: 5M rows in the table, only two (!!!!!) incomplete rows.

root@[fd00:1122:3344:116::3]:32221/omicron> select count(*) from omicron.public.audit_log;
   count
-----------
  4898486
(1 row)

root@[fd00:1122:3344:116::3]:32221/omicron> select count(*) from omicron.public.audit_log where time_started > now() - interval '7days';
  count
----------
  385581
(1 row)

root@[fd00:1122:3344:116::3]:32221/omicron> select count(*) from omicron.public.audit_log where time_completed is null;
  count
---------
      2
(1 row)

root@[fd00:1122:3344:116::3]:32221/omicron> select time_started, operation_id from omicron.public.audit_log where time_completed is null;
          time_started          | operation_id
--------------------------------+---------------
  2026-03-02 23:16:32.159074+00 | login_saml
  2025-10-18 15:01:21.671874+00 | login_saml
(2 rows)

david-crespo · 2026-03-11T18:51:32Z

nexus/examples/config.toml

 session_cleanup.max_delete_per_activation = 10000
+audit_log_timeout_incomplete.period_secs = 600
+audit_log_timeout_incomplete.timeout_secs = 14400
+audit_log_timeout_incomplete.max_update_per_activation = 1000


I wonder if it's even necessary to make these tunable in the config as opposed to hard coded in the task definition. I guess it's nice to be able to change them without making a code change, but will we really have to?

🤷‍♀️ i don't really care about being able to tune these in the config, but it's fine to leave them since you've already done it?

david-crespo · 2026-03-13T04:38:40Z

nexus/db-queries/src/db/datastore/audit_log.rs

+    /// of the immutable record. Overwriting it would change its timestamp
+    /// and violate the guarantee that the past doesn't change.
+    #[tokio::test]
+    async fn test_audit_log_timeout_then_late_completion() {


I think this test belongs here rather than the timeout task because it's really about what datastore.audit_log_entry_complete() does with an already timed out row. I got rid of the other tests that were redundant with the one in the task file, though.

hawkw · 2026-03-18T21:42:52Z

nexus/src/app/background/tasks/audit_log_timeout_incomplete.rs

-        };
-
-        let cutoff = Utc::now() - timeout_delta;
+        let cutoff = Utc::now() - self.timeout;


should this fail if it's out of range?

By out of range, do you mean overflow? In a247d54 I used checked_sub_signed and expect(), but it seems like an overflow would have panicked anyway, just with a slightly less direct error message.

I was thinking more that if the subtraction overflows, maybe we should return an error and just not do anything, instead of a sploding. 🤷‍♀️

nexus/src/app/background/tasks/audit_log_timeout_incomplete.rs

nexus/db-queries/src/db/datastore/audit_log.rs

hawkw · 2026-03-19T03:15:40Z

nexus/src/app/background/tasks/audit_log_timeout_incomplete.rs

-        let cutoff = Utc::now() - self.timeout;
+        let cutoff = Utc::now()
+            .checked_sub_signed(self.timeout)
+            .expect("now - timeout overflowed (timeout_secs is too large)");


hm, should we panic the whole nexus here, or just fail the activation? is the implication that the assert when we construct the background task should stop this from happening, because i don't think it will (if NTP is extremely un-synced i guess)

for the record i think this is highly unlikely regardless and i think i'm probably just being annoying about it at this point

hawkw · 2026-03-19T03:30:34Z

.claude/skills/add-background-task/SKILL.md

+
+Create the task module. The struct holds whatever state it needs (typically `Arc<DataStore>` plus config). Implement `BackgroundTask::activate` by delegating to an `async fn actually_activate(&mut self, opctx) -> YourStatus` method, then serialize the status to `serde_json::Value`. The `actually_activate` pattern makes unit testing easy without going through the trait.
+
+Logging conventions: `debug` when there's nothing to do, `info` when routine work was done, `warn` when the work done indicates something is wrong (e.g., cleaning up after a crash), `error` on failure. Log errors as structured fields with the `; &err` slog syntax (which uses the `SlogInlineError` trait), not by interpolating into the message string. For the error string in the status struct, use `InlineErrorChain::new(&err).to_string()` (from `slog_error_chain`) to capture the full cause chain. Status error strings should not repeat the task name — omdb already shows which task you're looking at.


Some of this (the stuff about slog error chains) feels like maybe something slog should always be instructed to do, not just in bg tasks? If there's another place to put that advice that will cause claude to use it any time it writes code in this repo, that could be a good idea.

A checked-in CLAUDE.md for the repo would be a good spot for that. I've held off making one because the work people do in here is so varied that it's hard to write down truly general guidelines.

yeah. i feel like there is a lot of stuff that everyone would want it to know, such like "don't use the logger incorrectly" and "remember to run cargo xtask clippy instead of some other thing" and "if you run the tests using cargo test rather than cargo nextest, most of them will not work"

hawkw · 2026-03-19T03:31:28Z

.claude/skills/add-background-task/SKILL.md

+
+### 2. Task implementation (`nexus/src/app/background/tasks/<name>.rs`)
+
+Create the task module. The struct holds whatever state it needs (typically `Arc<DataStore>` plus config). Implement `BackgroundTask::activate` by delegating to an `async fn actually_activate(&mut self, opctx) -> YourStatus` method, then serialize the status to `serde_json::Value`. The `actually_activate` pattern makes unit testing easy without going through the trait.


i absolutely adore the fact that the stupid eliza-ism of actually_activate() is now being blasted directly into Claude's brain for all of eternity :)

Also, perhaps we should tell it that sometimes I've found it's sometimes better to pass an &mut WhateverStatus into actually_activate() --- in particular, I like to do this when I'm doing something that can complete partially, and then return an error. This way, you can record whatever has completed into the status, and then bail out using ? later, but the status has still recorded whatever you already did.

hawkw · 2026-03-19T03:32:38Z

.claude/skills/add-background-task/SKILL.md

+
+### 10. omdb output (`dev-tools/omdb/src/bin/omdb/nexus.rs`)
+
+Add a `print_task_<name>` function and wire it into the match in `print_task_details` (alphabetical order). Import the status type. Use the `const_max_len` + `WIDTH` pattern to align columns:


Wow another thing I did that one time has now been determined to be a Best Practice that the robot should do! This almost makes me feel okay about being replaced! :D

hawkw · 2026-03-19T03:38:02Z

nexus/src/app/background/tasks/audit_log_timeout_incomplete.rs

-        let cutoff = Utc::now() - self.timeout;
+        let cutoff = Utc::now()
+            .checked_sub_signed(self.timeout)
+            .expect("now - timeout overflowed (timeout_secs is too large)");


for the record i think this is highly unlikely regardless and i think i'm probably just being annoying about it at this point

hawkw

i think this is basically good, by the way, you can decide what to do about my last set of annoying comments

add-background-task claude skill, fix flaky omdb test

8f06de6

david-crespo commented Mar 11, 2026

View reviewed changes

background task to time out incomplete audit logs

b4bf111

david-crespo force-pushed the audit-log-timeout-task branch from 8639848 to b4bf111 Compare March 11, 2026 06:05

david-crespo requested a review from hawkw March 11, 2026 18:48

david-crespo commented Mar 11, 2026

View reviewed changes

david-crespo commented Mar 13, 2026

View reviewed changes

preemptive eliza fixes based on #10042

c5b7f89

david-crespo force-pushed the audit-log-timeout-task branch from 9ed1955 to c5b7f89 Compare March 13, 2026 05:40

david-crespo added 4 commits March 18, 2026 12:28

move validation of timeout to config new

d37eed5

error logging tweaks

a6c7e63

merge main and resolve conflicts

51f6f1c

update skill based on last few commits on both branches

99eab7b

hawkw reviewed Mar 18, 2026

View reviewed changes

eliza comments

a247d54

hawkw reviewed Mar 19, 2026

View reviewed changes

hawkw approved these changes Mar 19, 2026

View reviewed changes

add note to skill about passing in mutable status object

e9cafb9

david-crespo enabled auto-merge (squash) March 19, 2026 03:49

	/// This doesn't mean the operation itself timed out (which would be an
	/// error, and I don't think we even have API timeouts) but rather that the
	/// attempts to complete the log entry failed (or were never even attempted
	/// because, e.g., Nexus crashed during the operation), and this entry had
	/// to be cleaned up later by a background job (which doesn't exist yet)
	/// after a timeout. Note we represent this result status as "Unknown" in
	/// the external API because timeout is an implementation detail and makes
	/// it sound like the operation timed out.
	Timeout,

	AuditLogResultKind::Timeout => {
	audit::AuditLogEntryResult::Unknown
	}


		Create the task module. The struct holds whatever state it needs (typically `Arc<DataStore>` plus config). Implement `BackgroundTask::activate` by delegating to an `async fn actually_activate(&mut self, opctx) -> YourStatus` method, then serialize the status to `serde_json::Value`. The `actually_activate` pattern makes unit testing easy without going through the trait.

		Logging conventions: `debug` when there's nothing to do, `info` when routine work was done, `warn` when the work done indicates something is wrong (e.g., cleaning up after a crash), `error` on failure. Log errors as structured fields with the `; &err` slog syntax (which uses the `SlogInlineError` trait), not by interpolating into the message string. For the error string in the status struct, use `InlineErrorChain::new(&err).to_string()` (from `slog_error_chain`) to capture the full cause chain. Status error strings should not repeat the task name — omdb already shows which task you're looking at.


		### 2. Task implementation (`nexus/src/app/background/tasks/<name>.rs`)

		Create the task module. The struct holds whatever state it needs (typically `Arc<DataStore>` plus config). Implement `BackgroundTask::activate` by delegating to an `async fn actually_activate(&mut self, opctx) -> YourStatus` method, then serialize the status to `serde_json::Value`. The `actually_activate` pattern makes unit testing easy without going through the trait.


		### 10. omdb output (`dev-tools/omdb/src/bin/omdb/nexus.rs`)

		Add a `print_task_<name>` function and wire it into the match in `print_task_details` (alphabetical order). Import the status type. Use the `const_max_len` + `WIDTH` pattern to align columns:

Conversation

david-crespo commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How often are log entries left incomplete?

Config options

Claude skill for adding background tasks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-crespo commented Mar 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

david-crespo commented Mar 11, 2026 •

edited

Loading