Skip to content

fix(releases): Prevent row-lock contention on last_seen bump#115443

Open
yuvmen wants to merge 4 commits into
masterfrom
yuvmen/fix-rpe-last-seen-contention
Open

fix(releases): Prevent row-lock contention on last_seen bump#115443
yuvmen wants to merge 4 commits into
masterfrom
yuvmen/fix-rpe-last-seen-contention

Conversation

@yuvmen
Copy link
Copy Markdown
Member

@yuvmen yuvmen commented May 12, 2026

Summary

  • High-volume releases (e.g. popular mobile apps) cause concurrent ingest workers to pile up trying to UPDATE last_seen on the same ReleaseProjectEnvironment/ReleaseEnvironment row, triggering PostgreSQL statement_timeout cancellations (SENTRY-5HQZ — 6105 occurrences).
  • The unhandled OperationalError aborts the rest of the event save pipeline (nodestore persistence, release counts, group release records), causing silent data loss.
  • Adds a cache.add-based distributed lock so only one worker per row per 60s attempts the DB update, eliminating the thundering herd. Also catches OperationalError as a safety net so a failed best-effort last_seen bump never blocks event processing.
  • Applied the same fix to both ReleaseProjectEnvironment and ReleaseEnvironment which had identical vulnerable patterns.

Fixes SENTRY-5HQZ

Test plan

  • Existing tests pass (3 for RPE, 1 for RE)
  • New test: test_bump_skipped_when_cache_lock_held — verifies second worker skips the DB update when the cache lock is held
  • New test: test_bump_survives_operational_error — verifies OperationalError is caught and instance is returned
  • mypy and ruff pass

…aseProjectEnvironment and ReleaseEnvironment

High-volume releases cause concurrent workers to pile up on the same row's
UPDATE for last_seen, hitting statement_timeout (SENTRY-5HQZ). This adds a
cache-based distributed lock so only one worker per row per 60s attempts the
DB update, and catches OperationalError so a failed bump doesn't abort the
rest of the event save pipeline.

Fixes SENTRY-5HQZ
@yuvmen yuvmen requested a review from a team as a code owner May 12, 2026 22:07
@github-actions github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label May 12, 2026
Comment thread src/sentry/models/releaseenvironment.py Outdated
if cache.add(bump_key, "1", timeout=60):
try:
cls.objects.filter(
id=instance.id, last_seen__lt=datetime - timedelta(seconds=60)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be updated to last_seen__lt=datetime now, since the lock is doing the work for us

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah true

if cache.add(bump_key, "1", timeout=60):
try:
cls.objects.filter(
id=instance.id, last_seen__lt=datetime - timedelta(seconds=60)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

yuvmen added 2 commits May 12, 2026 15:20
The cache lock handles the 60s throttle now, so the SQL filter only
needs to prevent setting last_seen backwards.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants