[BUG] query_store_collector: cutoff_time bug causes full re-collection on every run, leading to tempdb exhaustion and database explosion

### Component

SQL collection scripts

### Performance Monitor Version

PerformanceMonitor v2.0.0

### SQL Server Version

SQL Server 2019 Standard Edition (15.0.2130.3)

### Windows Version

Windows Server 2019

### Describe the Bug

Three related bugs in `collect.query_store_collector` that combine to cause 
catastrophic resource consumption on environments with large Query Store databases.

**Bug 1 — Wrong cutoff_time calculation (Critical)**
The cutoff time is calculated as:

    SELECT @last_collection_time = MAX(qsd.utc_last_execution_time)
    FROM collect.query_store_data AS qsd;

This uses the last *observed query execution time* from collected data, 
not the last time the collector actually ran. Since `utc_last_execution_time` 
reflects when queries were *executed in monitored databases* rather than 
when they were *collected*, the MAX value does not advance with each 
collection cycle. The cutoff stays anchored in the past, causing full 
re-collection of all historical data on every run.

**Bug 2 — First run collects 3 days of history regardless of Query Store size (High)**
The hardcoded 3-day lookback on first run is dangerous on environments 
with large Query Store databases. With 5.6 GB of Query Store data spread 
across 9 databases, this means decompressing thousands of XML execution 
plans in a single run.

**Bug 3 — Default frequency too aggressive for heavy collectors (Medium)**
`query_store_collector` defaults to every 2 minutes with a 
`max_duration_minutes` of 10, while the Hung Job Monitor threshold 
defaults to 5 minutes. On large environments the collector never finishes 
before being killed.

### Steps to Reproduce

1. Install PerformanceMonitor v2.0.0 on a SQL Server instance with 
   multiple databases having Query Store enabled and significant history
   (in our case: 9 databases, ~5.6 GB total Query Store data, 
   5 databases between 790 MB and 870 MB each)
2. Let the Collection job run at its default schedule (every 2 minutes)
3. Observe the job running indefinitely without completing
4. Check sp_WhoIsActive — the query_store_collector session will show 
   hundreds of millions of reads and massive tempdb consumption
5. After 5 minutes the Hung Job Monitor kills the job
6. The job restarts and the cycle repeats
7. Overnight: PerformanceMonitor database grows by 150+ GB, 
   tempdb exhausted, server out of disk space

### Expected Behavior

- First run should collect a reasonable window of data 
  (e.g. frequency_minutes interval, not 3 days)
- Subsequent runs should collect only data since the last 
  successful collection, based on when the collector ran 
  rather than when queries were last executed
- Each collection cycle should complete well within the 
  configured max_duration_minutes

### Actual Behavior

- Collector ran for 47+ minutes without completing
- 330 million reads, 28 million writes in a single run
- 150+ GB of tempdb allocated (spilling to disk)
- PAGEIOLATCH_SH / PAGEIOLATCH_EX waits on tempdb throughout
- PerformanceMonitor database grew by 150+ GB overnight
- Hung Job Monitor killed the job after 5 minutes on every run
- No data was ever successfully collected (all rolled back)

### Error Messages / Log Output

```text
sp_WhoIsActive output at peak:

    duration:          00:47:21
    wait_type:         PAGEIOLATCH_SH / PAGEIOLATCH_EX (tempdb)
    CPU:               1,112,600 ms
    reads:             330,924,462
    writes:             28,505,148
    tempdb_allocations: 23,736,504 pages (~185 GB)
    tempdb_current:      5,894,024 pages (~46 GB)
    open_tran_count:     2

config.collection_log:

    collector_name:      hung_job_monitor
    collection_status:   JOB_HUNG
    error_message:       Job "PerformanceMonitor - Collection" hung for 
                         7 minutes (threshold: 5 minutes). Stopping job.
```

### Screenshots

_No response_

### Additional Context

**Environment:**
- SQL Server 2019 Standard Edition (15.0.2130.3)
- Windows Server 2019
- PerformanceMonitor v2.0.0
- 9 databases with Query Store enabled
- Query Store sizes: 12 MB to 867 MB per database,
  5 databases between 790 MB and 870 MB
- Total Query Store data: ~5.6 GB

**Proposed fixes:**

Bug 1 — Replace MAX(utc_last_execution_time) with last successful 
collection time from config.collection_log:

    -- BEFORE:
    SELECT @last_collection_time = MAX(qsd.utc_last_execution_time)
    FROM collect.query_store_data AS qsd;

    -- AFTER:
    SELECT TOP (1)
        @last_collection_time = TODATETIMEOFFSET(cl.collection_time, 0)
    FROM config.collection_log AS cl
    WHERE cl.collector_name = N'query_store_collector'
    AND   cl.collection_status = N'SUCCESS'
    ORDER BY cl.collection_time DESC;

Bug 2 — Replace 3-day hardcoded lookback with frequency_minutes interval:

    -- BEFORE:
    SET @cutoff_time = TODATETIMEOFFSET(DATEADD(DAY, -3, SYSUTCDATETIME()), 0);

    -- AFTER:
    SET @cutoff_time = TODATETIMEOFFSET(
        DATEADD(MINUTE, -ISNULL(@collection_interval_minutes, 30), SYSUTCDATETIME()),
        0
    );

Bug 3 — Increase default frequency for heavy collectors in installation scripts:

    UPDATE config.collection_schedule
    SET frequency_minutes = 15, max_duration_minutes = 12
    WHERE collector_name IN
    (
        N'query_store_collector',
        N'query_stats_collector',
        N'procedure_stats_collector'
    );

    -- And increase @normal_max_duration_minutes from 5 to 15 
    -- in 45_create_agent_jobs.sql for the Hung Job Monitor.

**Workaround applied while waiting for fix:**
Inserted a seed row in collect.query_store_data to anchor 
MAX(utc_last_execution_time) to the current time, preventing 
the 3-day first-run lookback from triggering on reinstallation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] query_store_collector: cutoff_time bug causes full re-collection on every run, leading to tempdb exhaustion and database explosion #335

Component

Performance Monitor Version

SQL Server Version

Windows Version

Describe the Bug

Steps to Reproduce

Expected Behavior

Actual Behavior

Error Messages / Log Output

Screenshots

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[BUG] query_store_collector: cutoff_time bug causes full re-collection on every run, leading to tempdb exhaustion and database explosion #335

Description

Component

Performance Monitor Version

SQL Server Version

Windows Version

Describe the Bug

Steps to Reproduce

Expected Behavior

Actual Behavior

Error Messages / Log Output

Screenshots

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions