Skip to content

perf(service): Write HV tombstone before LT data to reduce orphan risk#365

Draft
jan-auer wants to merge 4 commits intomainfrom
worktree-invert-tombstones
Draft

perf(service): Write HV tombstone before LT data to reduce orphan risk#365
jan-auer wants to merge 4 commits intomainfrom
worktree-invert-tombstones

Conversation

@jan-auer
Copy link
Member

@jan-auer jan-auer commented Mar 10, 2026

Previously, large-object inserts followed LT-first ordering: write data to long-term storage, then write the redirect tombstone to high-volume. Concurrent inserts or pod kills between those two steps left an orphaned long-term object — data in LT with no tombstone in HV — permanently unreachable with no recovery path.

This flips the ordering to HV-first: write the tombstone first, then write the data. A failure between the two steps now produces a headless tombstone (tombstone in HV, no data in LT) instead. Headless tombstones are safe and self-healing: reads return None, deletes remove them, and re-inserts overwrite them.

The tradeoff is deliberate: we lower the risk of orphans in long-term storage — which are silent data leaks with no recovery — and instead accept headless tombstones, which are a well-defined recoverable state.

For small objects, a new put_non_tombstone trait method atomically rejects the write if a tombstone already exists at the key, routing the payload to long-term storage instead. BigTable implements this with CheckAndMutateRowRequest; other backends fall back to a non-atomic read-then-write.

One gap remains: a concurrent insert + delete can still race to produce an orphaned long-term object. Fixing that requires per-key serialization. This PR is an intermediary improvement; the full solution with no orphans at all is tracked separately.

Ref FS-236

@linear-code
Copy link

linear-code bot commented Mar 10, 2026

Previously, large-object inserts followed LT-first ordering: write data
to long-term storage, then write the redirect tombstone to high-volume.
A failure or pod kill between those two steps left an orphaned long-term
object — data in LT with no tombstone in HV, permanently unreachable.

This flips the ordering to HV-first: write the tombstone first, then
write the data. A failure between the two steps now leaves a headless
tombstone (tombstone in HV, no data in LT) instead. Headless tombstones
are safe and self-healing: reads return None, deletes remove them, and
re-inserts overwrite them.

For small objects, a new `put_non_tombstone` trait method is added that
atomically rejects the write if a tombstone already exists at the key,
routing the payload to long-term storage instead. BigTable implements
this with a CheckAndMutateRowRequest; other backends use a non-atomic
read-then-write default.

There is one remaining gap: a concurrent insert + delete can still race
to produce an orphaned long-term object. Fixing that requires per-key
serialization and is tracked separately.

Co-Authored-By: Claude <noreply@anthropic.com>
@jan-auer jan-auer force-pushed the worktree-invert-tombstones branch from 1ab7bfe to 9f5bb43 Compare March 10, 2026 14:46
jan-auer and others added 3 commits March 10, 2026 16:33
Deduplicates the expiration/mutation-building logic shared between
put_row and put_non_tombstone into a single write_mutations method.

Co-Authored-By: Claude <noreply@anthropic.com>
…tcome

Both types represented the same concept — an operation that either
executed or was blocked by a redirect tombstone — with different variant
names. Unify them into ConditionalOutcome { Executed, Tombstone }.

Co-Authored-By: Claude <noreply@anthropic.com>
…ibility

Keep the ConditionalOutcome rename from this branch while adopting
the pub visibility from main.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant