Skip to content

Bug 2017561 - Use IN clauses instead of OR-chains in alert summary#9535

Open
florinbilt wants to merge 3 commits into
mozilla:masterfrom
florinbilt:audit
Open

Bug 2017561 - Use IN clauses instead of OR-chains in alert summary#9535
florinbilt wants to merge 3 commits into
mozilla:masterfrom
florinbilt:audit

Conversation

@florinbilt
Copy link
Copy Markdown
Collaborator

Summary

Refactor _build_tc_metadata_map and _build_duplicated_summaries_map in PerformanceAlertSummaryViewSet to use SQL IN clauses instead of chunked OR-chains. No behavior change —
payload is byte-for-byte identical (verified by JSON diff across 10 summary IDs of varying sizes).

Problem

For alert summaries with thousands of alerts, /api/performance/alertsummary/?id= was issuing up to 9 chunked queries against performance_datum, each with a 500-way OR over
(repository_id, signature_id, push_id) tuples — ~69 KB of SQL per query. On a real summary with 2103 alerts (id 43868), this alone accounted for ~5s of DB time.

Solution

Collapse the lookup into a single query that filters via three IN clauses (repository_id__in, signature_id__in, push_id__in) and discard the small amount of cartesian
over-fetch in Python. The same treatment is applied to _build_duplicated_summaries_map.

Benchmarks — DB time

Staging Postgres, 10 representative summary IDs, 7 runs each, mock-tc to isolate DB performance, median reported.

ID alerts Queries before->after DB ms before->after DB gain
43868 2103 21 -> 13 4515 -> 2234 -50.5%
44024 1430 18 -> 13 3678 -> 2217 -39.7%
43986 563 15 -> 13 2805 -> 2186 -22.1%
44920 470 14 -> 13 2584 -> 2176 -15.8%
45593 422 14 -> 13 2539 -> 2179 -14.2%
45169 303 14 -> 13 2537 -> 2188 -13.8%
45512 267 14 -> 13 2498 -> 2174 -13.0%
41781 211 13 -> 13 2284 -> 2162 -5.3%
43290 193 13 -> 13 2270 -> 2166 -4.6%
44042 133 13 -> 13 2254 -> 2165 -3.9%

Query count is now constant (13) regardless of summary size; previously it scaled with alert count due to OR-chain chunking.

Benchmarks — real endpoint

GET /api/performance/alertsummary/?id=X. cold = first request with Redis cleared; warm = median of 3 subsequent requests.

ID alerts regressions Opt cold Opt warm Base cold Base warm Cold gain
43868 2103 0 ( 0%) 7.16s 7.48s 17.13s 10.53s -58%
41867 120 56 ( 47%) 4.21s 3.71s 4.56s 3.85s -8%
45158 401 289 ( 72%) 19.70s 9.21s 19.44s 9.68s ~0%
43986 563 563 (100%) 41.29s 40.84s 43.16s 42.00s -4%
44024 1430 1430(100%) 101.06s 96.92s 103.63s 105.57s -2%

The DB savings are present everywhere; their visibility at endpoint level depends on the alert mix:

  • Improvement-heavy summaries (low regression count) are DB-bound — the optimization shows up in full (e.g., id 43868: 17s -> 7s cold, the original problem case from the bug).
  • Regression-heavy summaries are bottlenecked by external Taskcluster HTTP calls inside _fetch_profile_urls, which the DB optimization cannot offset.

@junngo
Copy link
Copy Markdown
Contributor

junngo commented May 21, 2026

Thanks for the detailed tests and the diff validation!
I also checked the query plans on more usual, smaller-scale data via Redash, where the query performance difference was pretty small. But the production scale benchmarks here look reasonable, especially with the query count no longer scaling with alert volume.
Happy with the change :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants