Conversation
…points, computes flap status in memory, and returns a HashSet<Guid>. The loop uses that pre-computed set — zero async calls inside the loop.
| 5 | HistoryService full scan per request | High | ✅ Fixed |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Title
Fix SQLite lock contention, thread pool starvation, and unbounded raw data queries
PR Description
This PR resolves major SQLite performance and stability issues affecting live monitoring, status page responsiveness, rollup processing, and long-term database growth.
The primary root causes were:
Concurrent SQLite writes from probe tasks
Full-table scans against
check_result_rawSync-over-async thread pool blocking
Excessive per-probe DB access
Unbounded historical queries loading millions of rows into memory
Missing automated pruning
The changes introduce WAL mode, a centralized async write queue, query optimizations, endpoint caching, scheduled pruning, and multiple background processing improvements.
Changes Included
1. Added
SqliteWalInterceptorIntroduced a connection interceptor that applies SQLite PRAGMA optimizations on every new DB connection:
journal_mode=WALbusy_timeout=5000synchronous=NORMALcache_size=-8000This enables concurrent reads/writes, reduces lock contention, and improves overall DB responsiveness.
2. Added
CheckResultWriteQueueImplemented a centralized background write queue for probe results.
Previous behavior
Each probe executed its own
SaveChangesAsync()call, causing massive SQLite write lock contention when many probes completed simultaneously.New behavior
Probe results are queued via
Channel<T>Single background writer batches up to 100 items per save
Queue bounded at 10,000 entries with oldest-item drop strategy
RTT updates use
ExecuteUpdateAsyncRemaining items are flushed during shutdown
Result:
Eliminates concurrent SQLite writers
Reduces write amplification
Improves monitoring stability under load
3. Updated
OutageDetectionServiceReplaced inline async DB writes with fire-and-forget queue enqueueing:
SaveCheckResultAsync()→SaveCheckResult()Actual persistence handled asynchronously by
CheckResultWriteQueueThis removes DB latency from probe execution paths.
4. Updated
MonitoringBackgroundServiceFixed timer restart thundering herd
Previously all probe timers restarted every 15 seconds, causing probes to execute simultaneously.
Now:
Timers restart only when intervals actually change
Uses
ConcurrentDictionary<Guid, int>to track intervalsRemoved per-probe DB reads
Previously each probe executed
FindAsync(endpointId).Now:
Endpoints cached in-memory via
ConcurrentDictionary<Guid, Endpoint>Refreshed periodically
Probe execution performs zero DB reads
5. Optimized
StatusServiceResolved multiple severe performance issues.
Removed expensive full-table scans
Old query pattern:
GroupBy + OrderByDescending + FirstOrDefaultTriggered scans across millions of
check_result_rawrowsStatus now uses:
endpoint.LastStatusendpoint.LastRttMsFixed sync-over-async thread pool starvation
Previous implementation called:
inside loops for every endpoint, blocking ASP.NET Core thread pool threads.
Replaced with:
GetFlappingEndpointIdsAsync(endpointIds)Single batched query
HashSet<Guid>.Contains()lookupsRemoved unused
CountAsync()Deleted unnecessary DB query that was never used.
6. Optimized
RollupServiceFixed unbounded memory usage in rollup processing.
Previous behavior
Loaded entire tables into memory:
then filtered in C#.
New behavior
Date filtering moved into SQL
WHEREAdded
AsNoTracking()Removed blocking
.ToList()Uses fully async queries
This dramatically reduces memory usage and DB pressure.
7. Optimized
HistoryServiceFixed all historical query methods loading entire endpoint histories before filtering.
Updated methods:
GetRawDataAsyncGetRollup15mDataAsyncGetRollupDailyDataAsyncGetOutagesAsyncChanges:
Date filtering moved into SQL
Added
AsNoTracking()Replaced synchronous list materialization with async equivalents
8. Added
PruneBackgroundServicePreviously pruning logic existed but required manual invocation.
Added automated scheduled pruning service:
Runs daily at 02:00
Calls
IPruneService.PruneRawDataAsync()Reads config from
Data:PruningUses scoped service pattern for DbContext access
This prevents uncontrolled growth of
check_result_raw.9. Updated
Setting | Before | After | Reason -- | -- | -- | -- Data.Retention.RawDays | 60 | 30 | Rollups preserve long-term history; old raw data is redundant Data.Pruning.IntervalHours | 168 | 24 | Continuous pruning prevents runaway table growth EF Core SQL logging | Enabled | Removed | Reduced memory/log overheadappsettings.jsonThese should ideally be separated into independent PRs.