-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Component
SQL collection scripts
Performance Monitor Version
PerformanceMonitor v2.0.0
SQL Server Version
SQL Server 2019 Standard Edition (15.0.2130.3)
Windows Version
Windows Server 2019
Describe the Bug
Three related bugs in collect.query_store_collector that combine to cause
catastrophic resource consumption on environments with large Query Store databases.
Bug 1 — Wrong cutoff_time calculation (Critical)
The cutoff time is calculated as:
SELECT @last_collection_time = MAX(qsd.utc_last_execution_time)
FROM collect.query_store_data AS qsd;
This uses the last observed query execution time from collected data,
not the last time the collector actually ran. Since utc_last_execution_time
reflects when queries were executed in monitored databases rather than
when they were collected, the MAX value does not advance with each
collection cycle. The cutoff stays anchored in the past, causing full
re-collection of all historical data on every run.
Bug 2 — First run collects 3 days of history regardless of Query Store size (High)
The hardcoded 3-day lookback on first run is dangerous on environments
with large Query Store databases. With 5.6 GB of Query Store data spread
across 9 databases, this means decompressing thousands of XML execution
plans in a single run.
Bug 3 — Default frequency too aggressive for heavy collectors (Medium)
query_store_collector defaults to every 2 minutes with a
max_duration_minutes of 10, while the Hung Job Monitor threshold
defaults to 5 minutes. On large environments the collector never finishes
before being killed.
Steps to Reproduce
- Install PerformanceMonitor v2.0.0 on a SQL Server instance with
multiple databases having Query Store enabled and significant history
(in our case: 9 databases, ~5.6 GB total Query Store data,
5 databases between 790 MB and 870 MB each) - Let the Collection job run at its default schedule (every 2 minutes)
- Observe the job running indefinitely without completing
- Check sp_WhoIsActive — the query_store_collector session will show
hundreds of millions of reads and massive tempdb consumption - After 5 minutes the Hung Job Monitor kills the job
- The job restarts and the cycle repeats
- Overnight: PerformanceMonitor database grows by 150+ GB,
tempdb exhausted, server out of disk space
Expected Behavior
- First run should collect a reasonable window of data
(e.g. frequency_minutes interval, not 3 days) - Subsequent runs should collect only data since the last
successful collection, based on when the collector ran
rather than when queries were last executed - Each collection cycle should complete well within the
configured max_duration_minutes
Actual Behavior
- Collector ran for 47+ minutes without completing
- 330 million reads, 28 million writes in a single run
- 150+ GB of tempdb allocated (spilling to disk)
- PAGEIOLATCH_SH / PAGEIOLATCH_EX waits on tempdb throughout
- PerformanceMonitor database grew by 150+ GB overnight
- Hung Job Monitor killed the job after 5 minutes on every run
- No data was ever successfully collected (all rolled back)
Error Messages / Log Output
sp_WhoIsActive output at peak:
duration: 00:47:21
wait_type: PAGEIOLATCH_SH / PAGEIOLATCH_EX (tempdb)
CPU: 1,112,600 ms
reads: 330,924,462
writes: 28,505,148
tempdb_allocations: 23,736,504 pages (~185 GB)
tempdb_current: 5,894,024 pages (~46 GB)
open_tran_count: 2
config.collection_log:
collector_name: hung_job_monitor
collection_status: JOB_HUNG
error_message: Job "PerformanceMonitor - Collection" hung for
7 minutes (threshold: 5 minutes). Stopping job.
Screenshots
No response
Additional Context
Environment:
- SQL Server 2019 Standard Edition (15.0.2130.3)
- Windows Server 2019
- PerformanceMonitor v2.0.0
- 9 databases with Query Store enabled
- Query Store sizes: 12 MB to 867 MB per database,
5 databases between 790 MB and 870 MB - Total Query Store data: ~5.6 GB
Proposed fixes:
Bug 1 — Replace MAX(utc_last_execution_time) with last successful
collection time from config.collection_log:
-- BEFORE:
SELECT @last_collection_time = MAX(qsd.utc_last_execution_time)
FROM collect.query_store_data AS qsd;
-- AFTER:
SELECT TOP (1)
@last_collection_time = TODATETIMEOFFSET(cl.collection_time, 0)
FROM config.collection_log AS cl
WHERE cl.collector_name = N'query_store_collector'
AND cl.collection_status = N'SUCCESS'
ORDER BY cl.collection_time DESC;
Bug 2 — Replace 3-day hardcoded lookback with frequency_minutes interval:
-- BEFORE:
SET @cutoff_time = TODATETIMEOFFSET(DATEADD(DAY, -3, SYSUTCDATETIME()), 0);
-- AFTER:
SET @cutoff_time = TODATETIMEOFFSET(
DATEADD(MINUTE, -ISNULL(@collection_interval_minutes, 30), SYSUTCDATETIME()),
0
);
Bug 3 — Increase default frequency for heavy collectors in installation scripts:
UPDATE config.collection_schedule
SET frequency_minutes = 15, max_duration_minutes = 12
WHERE collector_name IN
(
N'query_store_collector',
N'query_stats_collector',
N'procedure_stats_collector'
);
-- And increase @normal_max_duration_minutes from 5 to 15
-- in 45_create_agent_jobs.sql for the Hung Job Monitor.
Workaround applied while waiting for fix:
Inserted a seed row in collect.query_store_data to anchor
MAX(utc_last_execution_time) to the current time, preventing
the 3-day first-run lookback from triggering on reinstallation.