Skip to content

Add read-replica routing for Perfherder read-only endpoints#9528

Open
camd wants to merge 9 commits into
masterfrom
camd/read-replica-queries
Open

Add read-replica routing for Perfherder read-only endpoints#9528
camd wants to merge 9 commits into
masterfrom
camd/read-replica-queries

Conversation

@camd
Copy link
Copy Markdown
Collaborator

@camd camd commented May 18, 2026

Summary

Routes 11 read-only Perfherder GET endpoints through a separate read_replica PostgreSQL connection to relieve write-side contention on the primary database. Gated behind two env vars (READ_REPLICA_DATABASE_URL + READ_REPLICA_ENABLED) so it ships as a no-op until ops flips the switch.

  • New ReadReplicaRouter (Django DATABASE_ROUTERS) routes reads to the replica only when a thread-local flag is set and the model's app label is in an explicit allow-list ({"perf", "model"}).
  • New ReadReplicaMixin flips that flag for GET/HEAD/OPTIONS only, on a one-shot retry against primary if the replica raises OperationalError/InterfaceError, and emits a db_routing_fallback warning log. The mixin is a true no-op when the alias isn't configured.
  • Applied to 11 read-only viewsets in treeherder/webapp/api/performance_data.py. The 3 mutating viewsets (PerformanceAlertSummaryViewSet, PerformanceAlertViewSet, PerformanceTagViewSet) are intentionally left on primary so create→read flows stay consistent.

Design doc: .claude/plans/READ_REPLICA_DESIGN.md (local). Mechanism is reusable — applying it to Logviewer / Intermittent Failures View later is a one-line viewset change each.

Test plan

  • Unit tests for router contract (allow-list, read/write/relation/migrate behavior) — 7 tests.
  • Unit tests for mixin (safe-method flip, no-flip on POST, state cleared on exception, one-shot retry, fallback log, no-op when alias absent) — 6 tests.
  • Integration tests with CaptureQueriesContext asserting the replica alias actually serves opted-in endpoints and does NOT serve excluded ones — 4 tests.
  • Full backend suite: 934 passed, 2 skipped, 10 xfailed — no regressions vs master.
  • Ops: set READ_REPLICA_DATABASE_URL in stage; flip READ_REPLICA_ENABLED=true; soak; watch db_routing_fallback log volume + primary DB CPU; then prod.
  • Kill switch: set READ_REPLICA_ENABLED=false (one env-var flip) and verify all routed endpoints continue returning correct data via primary.

🤖 Generated with Claude Code

camd and others added 7 commits May 17, 2026 12:58
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Routes 10 read-only Perfherder viewsets to the read replica (PerformanceSignatureViewSet,
PerformancePlatformViewSet, PerformanceJobViewSet, PerformanceDatumViewSet,
PerformanceBugTemplateViewSet, PerformanceIssueTrackerViewSet, PerformanceSummary,
PerformanceAlertSummaryTasks, PerfCompareResults, TestSuiteHealthViewSet).
Adds integration tests for signature list and summary endpoints. Updates
test_perfcompare_api.py, test_performance_data_api.py, and
test_performance_bug_template_api.py to declare read_replica database access.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two review fixes:

1. When the kill switch is off (READ_REPLICA_ENABLED=false or
   READ_REPLICA_DATABASE_URL unset), the mixin previously still wrapped
   dispatch, set the thread-local (a no-op without a router), and would
   emit a misleading "db_routing_fallback" log on any primary failure.
   The mixin now checks connections.databases for the alias and bypasses
   wrapping when absent.

2. The three test files that gained databases=["default","read_replica"]
   implicitly relied on fixtures pulling in transactional_db. Make the
   transactional requirement explicit with transaction=True so the
   behavior survives future fixture refactors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 18, 2026

Codecov Report

❌ Patch coverage is 96.96970% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.84%. Comparing base (3519a51) to head (fb7dc10).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
treeherder/config/db_routing.py 92.30% 4 Missing ⚠️
treeherder/config/settings.py 60.00% 2 Missing ⚠️
tests/config/test_db_routing.py 99.21% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9528      +/-   ##
==========================================
+ Coverage   82.75%   82.84%   +0.08%     
==========================================
  Files         604      607       +3     
  Lines       34973    35190     +217     
  Branches     3261     3276      +15     
==========================================
+ Hits        28943    29153     +210     
- Misses       5661     5668       +7     
  Partials      369      369              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

into reading from the ``read_replica`` database alias. The router only routes
models whose Django app label is in :data:`READ_REPLICA_APP_ALLOW_LIST`.

Design: .claude/plans/READ_REPLICA_DESIGN.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

Comment thread docs/accessing_data.md Outdated
The router and mixin are then active in dev. Routing is a no-op unless an
endpoint opts in via `ReadReplicaMixin`; see
`treeherder/config/db_routing.py` and the design at
`.claude/plans/READ_REPLICA_DESIGN.md`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

_RecordingView.raise_on_call = 0
_RecordingView.call_count = 0
_RecordingView.saw_use_replica = []
yield
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of this?

camd and others added 2 commits May 18, 2026 11:26
The feature is a no-op locally with the default .env (both env vars
unset → no read_replica alias → mixin early-returns). Standard local
dev doesn't need to know about the routing. The design, code-level
docstring, and PR deploy notes cover the cases that do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Successful replica routing was previously silent — only the fallback path
emitted a log line. Adds a single DEBUG-level log per routed request so a
developer (or ops) can grep \`db_routing routed_to=read_replica\` to confirm
which requests hit the replica.

Prod-silent by default (logger threshold is INFO); enable locally with
LOGGING_LEVEL=DEBUG.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants