Skip to content

[BUG] query_store_collector: cutoff_time bug causes full re-collection on every run, leading to tempdb exhaustion and database explosion #335

@ydeo-fcourtaud

Description

@ydeo-fcourtaud

Component

SQL collection scripts

Performance Monitor Version

PerformanceMonitor v2.0.0

SQL Server Version

SQL Server 2019 Standard Edition (15.0.2130.3)

Windows Version

Windows Server 2019

Describe the Bug

Three related bugs in collect.query_store_collector that combine to cause
catastrophic resource consumption on environments with large Query Store databases.

Bug 1 — Wrong cutoff_time calculation (Critical)
The cutoff time is calculated as:

SELECT @last_collection_time = MAX(qsd.utc_last_execution_time)
FROM collect.query_store_data AS qsd;

This uses the last observed query execution time from collected data,
not the last time the collector actually ran. Since utc_last_execution_time
reflects when queries were executed in monitored databases rather than
when they were collected, the MAX value does not advance with each
collection cycle. The cutoff stays anchored in the past, causing full
re-collection of all historical data on every run.

Bug 2 — First run collects 3 days of history regardless of Query Store size (High)
The hardcoded 3-day lookback on first run is dangerous on environments
with large Query Store databases. With 5.6 GB of Query Store data spread
across 9 databases, this means decompressing thousands of XML execution
plans in a single run.

Bug 3 — Default frequency too aggressive for heavy collectors (Medium)
query_store_collector defaults to every 2 minutes with a
max_duration_minutes of 10, while the Hung Job Monitor threshold
defaults to 5 minutes. On large environments the collector never finishes
before being killed.

Steps to Reproduce

  1. Install PerformanceMonitor v2.0.0 on a SQL Server instance with
    multiple databases having Query Store enabled and significant history
    (in our case: 9 databases, ~5.6 GB total Query Store data,
    5 databases between 790 MB and 870 MB each)
  2. Let the Collection job run at its default schedule (every 2 minutes)
  3. Observe the job running indefinitely without completing
  4. Check sp_WhoIsActive — the query_store_collector session will show
    hundreds of millions of reads and massive tempdb consumption
  5. After 5 minutes the Hung Job Monitor kills the job
  6. The job restarts and the cycle repeats
  7. Overnight: PerformanceMonitor database grows by 150+ GB,
    tempdb exhausted, server out of disk space

Expected Behavior

  • First run should collect a reasonable window of data
    (e.g. frequency_minutes interval, not 3 days)
  • Subsequent runs should collect only data since the last
    successful collection, based on when the collector ran
    rather than when queries were last executed
  • Each collection cycle should complete well within the
    configured max_duration_minutes

Actual Behavior

  • Collector ran for 47+ minutes without completing
  • 330 million reads, 28 million writes in a single run
  • 150+ GB of tempdb allocated (spilling to disk)
  • PAGEIOLATCH_SH / PAGEIOLATCH_EX waits on tempdb throughout
  • PerformanceMonitor database grew by 150+ GB overnight
  • Hung Job Monitor killed the job after 5 minutes on every run
  • No data was ever successfully collected (all rolled back)

Error Messages / Log Output

sp_WhoIsActive output at peak:

    duration:          00:47:21
    wait_type:         PAGEIOLATCH_SH / PAGEIOLATCH_EX (tempdb)
    CPU:               1,112,600 ms
    reads:             330,924,462
    writes:             28,505,148
    tempdb_allocations: 23,736,504 pages (~185 GB)
    tempdb_current:      5,894,024 pages (~46 GB)
    open_tran_count:     2

config.collection_log:

    collector_name:      hung_job_monitor
    collection_status:   JOB_HUNG
    error_message:       Job "PerformanceMonitor - Collection" hung for 
                         7 minutes (threshold: 5 minutes). Stopping job.

Screenshots

No response

Additional Context

Environment:

  • SQL Server 2019 Standard Edition (15.0.2130.3)
  • Windows Server 2019
  • PerformanceMonitor v2.0.0
  • 9 databases with Query Store enabled
  • Query Store sizes: 12 MB to 867 MB per database,
    5 databases between 790 MB and 870 MB
  • Total Query Store data: ~5.6 GB

Proposed fixes:

Bug 1 — Replace MAX(utc_last_execution_time) with last successful
collection time from config.collection_log:

-- BEFORE:
SELECT @last_collection_time = MAX(qsd.utc_last_execution_time)
FROM collect.query_store_data AS qsd;

-- AFTER:
SELECT TOP (1)
    @last_collection_time = TODATETIMEOFFSET(cl.collection_time, 0)
FROM config.collection_log AS cl
WHERE cl.collector_name = N'query_store_collector'
AND   cl.collection_status = N'SUCCESS'
ORDER BY cl.collection_time DESC;

Bug 2 — Replace 3-day hardcoded lookback with frequency_minutes interval:

-- BEFORE:
SET @cutoff_time = TODATETIMEOFFSET(DATEADD(DAY, -3, SYSUTCDATETIME()), 0);

-- AFTER:
SET @cutoff_time = TODATETIMEOFFSET(
    DATEADD(MINUTE, -ISNULL(@collection_interval_minutes, 30), SYSUTCDATETIME()),
    0
);

Bug 3 — Increase default frequency for heavy collectors in installation scripts:

UPDATE config.collection_schedule
SET frequency_minutes = 15, max_duration_minutes = 12
WHERE collector_name IN
(
    N'query_store_collector',
    N'query_stats_collector',
    N'procedure_stats_collector'
);

-- And increase @normal_max_duration_minutes from 5 to 15 
-- in 45_create_agent_jobs.sql for the Hung Job Monitor.

Workaround applied while waiting for fix:
Inserted a seed row in collect.query_store_data to anchor
MAX(utc_last_execution_time) to the current time, preventing
the 3-day first-run lookback from triggering on reinstallation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions